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Abstract 

The  role  of  prior  knowledge  in  skill  acquisition  is  to  enable  the  learner  to  detect  and  to 
correct  errors.  Computational  mechanisms  that  carry  out  these  two  functions  are 
implemented  in  a  simulation  model  which  represents  prior  knowledge  in  constraints.  The 
model  learns  symbolic  skills  in  mathematics  and  science  by  noticing  and  correcting 
constraint  violations.  Results  from  simulation  runs  include  quantitative  predictions  about 
the  learning  curve  and  about  transfer  of  training.  Because  constraints  can  represent 
instructions  as  well  as  prior  knowledge,  the  model  also  simulates  one-on-one  tutoring.  The 
implications  for  the  design  of  instruction  include  a  detailed  specification  of  the  content  of 
effective  feedback  messages  for  intelligent  tutoring  systems. 
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THE  ROLE  OF  KNOWLEDGE  IN  LEARNING 

Learning  and  knowledge  are  doubly  related.  On  the  one  hand, 
knowledge  is  the  outcome  of  learning.  On  the  other  hand,  knowledge  is 
one  of  the  inputs  into  the  learning  process.  New  skills  are  constructed 
within  the  context  provided  by  prior  knowledge.  This  is  no  less  true  of 
technical  domains  such  as  mathematics,  science,  and  engineering  than  of 
common  sense  domains  such  as  cooking  and  travel  planning. 

Cognitive  scientists  from  Ebbinghaus  (1964/1885)  to  VanLehn 
(1982)  have  sought  to  escape  the  complexities  of  prior  knowledge  by 
studying  situations  in  which  such  knowledge  plays  a  minimal  role.  This 
simplification  has  payed  off  theoretically.  Following  the  pioneering  pa¬ 
pers  by  Anzai  and  Simon  (1979)  and  by  Anderson,  Kline,  and  Beasley 
(1979)  several  computational  models  of  the  acquisition  of  cognitive 
skills  in  the  absence  of  prior  knowledge  have  been  proposed  (e.  g., 
Anderson,  1983;  Holland  et  al.,  1986;  Langley,  1987;  Ohlsson,  1987a; 
Rosenbloom,  1986;  VanLehn,  1990).  These  models  assume  that  proce¬ 
dural  knowledge  forms  a  closed  loop:  Problem  solving  methods  gener¬ 
ate  problem  solving  steps  which,  in  turn,  generate  the  experiences  from 
which  new  problem  solving  methods  are  induced.  Simulation  models  of 
this  kind  constitute  an  important  advance  over  the  mathematical  and  ver¬ 
bal  learning  theories  of  the  past,  but  the  learning  mechanisms  proposed 
within  this  paradigm  (chunking,  composition,  discrimination,  general¬ 
ization,  grammar  induction,  subgoaling,  etc.)  do  not  explain  the  role  of 
prior  knowledge  in  learning.  There  is  no  point  along  the  method-step- 
method  loop  at  which  domain  knowledge  can  impact  the  learning  pro¬ 
cess. 

Empirical  research  of  knowledge-based  skill  acquisition  began  with 
Judd's  (1908)  study  of  the  skill  of  throwing  darts  at  underwater  targets 
with  and  without  knowledge  of  the  principle  of  refraction.  Both  he  and 
later  Katona  (1940)  reported  dramatic  effects  of  knowledge  about  under¬ 
lying  principles  on  skill  acquisition.  Kieras  and  Bovair  (1984)  also 
found  such  an  effea,  but  other  recent  studies  have  found  werjcer  effects 
or  no  effect  (e.  g.,  Gick  &  Holyoak,  1983;  Smith  &  Goodman,  1984). 
Educational  researchers  frequently  report  that  instruction  in  the  relevant 


domain  knowledge  does  not  guarantee  correct  action  (e.  g..  Resnick  & 
Omanson,  1987;  Reif,  1987).  On  the  other  hand,  inappropriate  prior 
knowledge-so-called  misconceptions—is  quite  likely  to  interfere  with 
successful  problem  solving  (Confrey,  1990).  The  empirical  results  indi¬ 
cate  that  we  do  not  yet  understand  how  prior  knowl^ge  interacts  with 
skill  acquisition  well  enough  to  ask  the  right  experimental  questions. 

Theoretical  analysis  of  the  function  of  prior  knowledge  in  skill  ac¬ 
quisition  has  hardly  began.  Ohlsson  (1987b)  proposed  a  computer 
model  which  explained  how  inferential  knowledge  about  the  domain  en¬ 
ables  a  learner  to  find  a  more  efficient  strategy  for  a  task  which  he  or  she 
already  knows  how  to  solve.  The  hypothesis  behind  this  model  was  that 
domain  knowledge  allows  the  learner  to  reason  about  possible  simplifi¬ 
cations  of  his  or  her  current  strategy.  The  model  simulated  speed-up  of  a 
simple  reasoning  strategy,  but  it  threw  no  light  on  the  role  of  domain 
knowledge  in  the  initial  acquisition  of  that  strategy. 

The  purpose  of  the  work  reported  here  is  to  explore  the  hypothesis 
that  the  function  of  knowledge  in  initial  skUl  acquisition  is  to  enable  the 
learner  to  detect  and  correct  errors.  This  hypothesis  is  embodied  in  a 
ranning  simulation  model  which  uses  prior  knowledge  to  learn  cognitive 
skills  fiiom  unguided  practice.  The  th^ry  predicts  the  negatively  accel¬ 
erated  practice  curve  observed  in  human  learning,  throws  some  new 
light  on  the  problem  of  transfer  of  training,  and  suggests  an  analysis  of 
tutoring  with  some  very  specific  implications  for  the  design  of  intelligent 
tutoring  systems. 

Throughout  this  chapter,  the  terms  "domain  knowledge"  and  "prior 
knowledge"  refer  to  declarative  knowledge,  while  the  terms  "cognitive 
skill",  "problem  solving  method",  "decision  rule",  and  "mental  proce¬ 
dures"  refer  to  procedural  knowledge.  Both  common  sense  and  philos¬ 
ophy  have  long  distinguished  between  theory  and  practice,  between 
knowing  that  and  knowing  how,  but  the  particular  formulation  of  this 
distinction  used  here  is  imported  fiom  Artificial  Intelligence  (Winograd, 
1975). 

Procedural  knowledge  is  prescriptive  and  use-specific.  To  a  first 
approximation,  it  consists  of  associations  between  goals,  situations,  and 
actions.  Examples  of  procedural  knowledge  are  place-value  algorithms 
for  arithmetic,  methods  for  electronic  trouble  shooting,  explanatory 


strategies  in  biology,  and  the  procedure  for  constructing  structural  for¬ 
mulas  for  organic  molecules.  Declarative  knowledge,  on  the  other  hand, 
is  descriptive  (as  opposed  to  prescriptive)  and  use-independent.  To  a 
first  approximation,  it  consists  of  facts  and  principles.  Examples  of 
declarative  knowledge  are  the  laws  of  the  number  system,  the  general 
gas  law,  Darwin’s  theory  of  evolution,  and  the  theory  of  the  co-valent 
bond.  The  function  of  procedural  knowledge  is  to  control  action;  the 
function  of  declarative  knowledge  is  to  provide  generality.  Intelligent 
behavior  requires  both  types  of  knowledge  (Anderson,  1976; 
Winograd,  1975). 

If  the  two  types  of  knowledge  are  distinct,  how  do  they  interact?  In 
particular,  if  declarative  knowledge  is  use-independent  and  distinct  from 
procedures,  then  how  does  it  influence  action?  The  problem  investigated 
in  the  research  program  summarized  in  this  chapter  is  how  (previously 
learned)  declarative  knowledge  affects  the  constmction  of  (new)  proce¬ 
dural  knowledge. 


A  FUNCTIONAL  THEORY  OF  SKILL  ACQUISITION 

Learning  happens  during  problem  solving;  to  learn  is  to  adapt  to  the 
structure  of  the  task  environment;  learning  is  triggered  by  contradictions 
between  the  outcomes  of  problem  solving  steps  and  prior  knowledge. 
These  three  principles  imply  a  particular  functional  breakdown  of  skill 
acquisition. 

Principle  1:  Learning  as  Problem  Solving 

During  practice,  the  learner  is  faced  with  problems  which  he  or  she 
does  not  yet  know  how  to  solve-that  is  why  he  or  she  is  practicing. 
Practice  is  problem  solving  and  skill  acquisition  is  the  encoding  of  the 
results  of  problem  solving  for  future  use.  People  solve  unfamiliar  prob¬ 
lems  with  so-called  weak  methods,  i.  e.,  problem  solving  methods 
which  are  so  general  that  they  can  be  applied  even  with  a  minimum  of 
information  about  the  task  environment.  The  weak  methods  people  have 


been  observed  to  use  include  analogical  inference,  hill  climbing,  for- 
w;j-d  search,  means-ends  analysis,  and  planning. 

Weak  methods  are  general  but  inefficient.  The  function  of  weak 
methods  during  practice  is  not  to  produce  complete  or  correct  problem 
solutions,  but  to  generate  task  relevant  behavior.  Activity  vis-a-vis  the 
task  provides  the  learner  with  the  opportunity  to  discover  the  structure 
of  the  task  environment.  Cognitive  sl^ls  are  constructed  by  interpreting, 
storing,  and  indexing  such  discover!^  so  that  they  can  be  retrieved  and 
applied  later.  The  function  of  weak  methods  is  to  provide  learning  op¬ 
portunities,  not  to  solve  problems. 

Individual  weak  methods  were  formalized  in  the  late  fifties  and 
early  sixties  (Feigenbaum  &  Feldman,  1963),  but  the  general  category 
of  weak  methods  was  first  identified  by  Newell  (1969,  1980).  Laird 
(1986)  has  suggested  that  there  exists  a  universal  weak  method  from 
which  all  other  weak  methods  can  be  derived. 

The  idea  that  learning  is  problem  solving  and  that  the  function  of 
weak  methods  is  to  provide  learning  opportunities  is  implicit  in  the  con¬ 
cept  of  trial  and  error  and  thus  traces  its  roots  back  to  behaviorism. 
Although  first  formalized  in  a  computational  model  by  Anzai  and  Simon 
(1979),  this  idea  is  central  to  several  recent  models  of  learning  (e.  g., 
Anderson,  1986;  Holland  et  al.,  1986;  Rosenbloom,  1986).  In  the  field 
of  machine  learning,  the  notion  that  learning  occurs  en  route  to  an  an¬ 
swer  rather  than  after  completion  of  a  practice  problem  has  been  em¬ 
phasized  by  Mostow  and  Bhatnager  (1987,  IWO)  in  their  work  on 
adaptive  search. 

Principle  2:  Learning  as  Adaptation 

Weak  methods  are  inefficient  because  they  are  general.  A  domain- 
specific  cognitive  skill  is  efficient  because  it  reflects  the  structure  of  the 
relevant  task  environment.  Skill  acquisition  begins  with  maximally  gen¬ 
eral  procedures  (weak  methods)  and  ends  with  domain-specific  skills. 
Learning  is  gradual  adaptation. 

The  process  of  adaptation  cannot  continue  indefinitely.  The  task 
environment  only  contains  so  much  structure  and  when  all  the  structure 
has  been  absorped,  the  skill  cannot  get  any  more  specific  or  better 


adapted.  In  complex  and  irregular  domains,  expert  strategies  are  be¬ 
tween  weak  methods  and  algorithms  in  specificity.  They  guide  behavior 
without  fully  determining  it  and  considerable  uncertainty  can  remain 
even  at  the  highest  level  of  expertise. 

The  idea  that  learning  proceeds  from  the  general  to  the  specific  is 
counterintuitive,  because  it  is  common  sense  that  learning  begins  with 
the  concrete  and  the  specific  and  moves  towards  the  general.  The  com¬ 
mon  sense  theory  has  little  support  in  systematic  research.  Formal  anal¬ 
yses  of  induction  (e.  g.,  Angluin  &  Smith,  1983)  have  revealed  that 
many  induction  problems  are  NP-complete  and  that  noisy  input  cripples 
most  induction  algorithms.  David  Hume  was  right;  induction  does  not 
work.  Knowledge  must  be  constructed  in  some  other  way. 
Specialization  of  pre-existing,  general  structures  is  one  alternative.  The 
particular  version  of  this  idea  in  which  learning  proceeds  from  general 
methods  to  task-specific  methods  was  implicit  in  early  computational 
models  (e.  g,,  Anzai  &  Simon,  1979),  but  was  to  the  best  of  my  knowl¬ 
edge  first  stated  in  two  papers  by  Langley  (1985)  and  by  Anderson 
(1987). 

The  idea  that  learning  is  adaptation  to  the  environment  can  be  for¬ 
mulated  in  many  different  ways,  as  a  comparison  between  Hull  (1943), 
Piaget  (1971),  and  Anderson  (1990)  demonstrates.  Until  recently,  psy¬ 
chologists  lacked  a  formal  method  for  describing  the  learner's  environ¬ 
ment  independently  of  the  learner.  This  threatened  to  make  the  principle 
of  adaptation  circular,  or  at  least  difficult  to  apply.  The  information  pro¬ 
cessing  approach  is  a  major  breakthrough  b^ause  it  provides  a  formal 
description  of  task  environments.  Specifically,  an  environment  is  de¬ 
scribed  as  a  setirch  space  (or  problem  space;  Newell  &  Simon,  1972). 
The  organism  is  then  naturally  described  as  a  strategy  for  traversing  that 
space.  Adaptation  has  a  very  definite  meaning  within  this  formalization: 
A  given  strategy  is  adapted  to  a  particular  task  environment  in  inverse 
proportion  to  the  amount  of  search  required  by  that  strategy  to  find  a 
path  frcxn  the  initial  state  to  the  goal  state.  A  maximally  adapted  strategy 
is  one  which  leads  to  the  goal  without  extra  or  unnecessary  steps.  ^ 


^In  an  alternative  approach,  Anderson  (1990)  describes  the  environment  in  terms  of 
its  statisticai  regularities.  Many  memory  phenomena  follow  from  the  assumption 


7 


Prin:iple  3:  Learning  as  Conflict  Resolution 

Novices  make  many  errors;  that  is  why  we  call  them  novices. 
Experts  do  not;  that  is  why  we  call  them  experts.  The  weak  methods 
employed  by  novices  produce  errors  because  they  are  overly  general, 
causing  problem  solving  steps  to  be  performed  in  situations  in  which 
they  are  not  appropriate.  The  task-specific  skills  of  experts  do  not  gen¬ 
erate  errors  because  they  constrain  actions  to  situations  in  which  they  are 
appropriate.  The  process  of  adapting  a  general  method  to  a  particular 
task  environment  is  a  process  of  gradually  eliminating  errors.  Error 
elimination  consists  of  two  subprocesses:  error  detection  and  error  cor¬ 
rection. 

Error  Detection.  Learners  can  detect  their  errors  in  three  ways: 
by  observing  environmental  ejects,  by  self-monitoring,  and  by  being 
told  by  others  (Reason,  1990,  Chap.  6).  Some  task  environments  pro¬ 
vide  direct  feedback  about  orors.  If  the  unknown  device  exploded  when 
the  red  button  was  pushed,  pushing  the  red  button  was  an  error.  Other 
task  environments  do  not  provide  feedback  of  this-  sort.  In  such  envi¬ 
ronments,  learners  can  detect  their  errors  by  checking  new  conclusions 
against  their  prior  knowledge.  Incomplete  or  incorrect  procedural 
knowledge  is  highly  likely  to  generate  conclusions  or  problem  states  that 
contradia  what  the  learner  knows  is  true  of  the  domain. 

As  an  illustration,  consider  the  following  everyday  situation;  You 
are  driving  to  an  unfamiliar  location  with  the  instmction  to  follow  route 
X  north  and  make  a  right-hand  turn  onto  Y-street.  You  are  looking  for 
the  turn  and  not  finding  it.  Did  you  overshoot  the  turn  or  did  you  not  go 
far  enough?  The  only  way  to  decide  whether  you  missed  your  turn  is  to 
know  some  landmark  (e.  g.,  a  bridge)  which  is  further  out  on  route  X 
than  the  mm  onto  Y-street.  (A  thoughtful  fnend  includes  such  a  land¬ 
mark  in  his  or  her  instruaions.)  When  you  see  the  landmark,  you  know 
that  you  missed  your  mm.  The  contradiction  between  the  prior  knowl- 


that  memory  is  ad^ed  to  those  regularities  (Anderson  &  Schooler,  1991).  Anderson 
(1993)  applies  this  approach  to  skill  acquisition  as  well. 


edge  that  "Y-street  is  before  the  bridge"  and  the  observation  "here  is  the 
bridge  now"  allows  you  to  recognize  that  you  have  made  a  mistake. 

Technical  skills  often  apply  in  symbolic  task  environments  in  which 
contradictions  between  outcomes  of  problem  solving  steps  and  prior 
knowledge  constitute  the  only  indicators  of  errors.  Mathematical  sym¬ 
bols  do  not  complain  about  being  inserted  into  false  equalities,  unsoiv- 
able  equations,  or  incorrect  calculations,  so  a  good  learner  checks  his  or 
her  calculations.  Checking,  say,  a  subtraction  by  adding  the  difference 
and  the  subtrahend  requires  the  knowledge  that  the  sum  of  the  difference 
and  the  subtrahend  ought  to  equal  the  minuend.  Stmctural  formulas  for 
organic  molecules  do  not  beep  when  the  laws  of  the  co-valent  bond  are 
violated.  Noticing  an  error  in  a  structural  formula  requires  the  knowl¬ 
edge  that  each  bond  ought  to  be  associated  with  exactly  two  electrons, 
that  the  total  number  of  electrons  cannot  exceed  the  number  of  valence 
electrons  for  the  molecule,  and  so  on.  The  more  knowledge,  the  higher 
the  probability  that  the  learner  can  detect  his  or  her  errors. 

Error  Correction.  The  detection  of  a  contradiction  between  a 
new  conclusion  and  prior  knowledge  leads  to  processes  that  aim  to  re¬ 
store  consistency  by  revising  the  relevant  procedural  knowledge.  If  the 
execution  of  aaion  A  in  situation  Sj  leads  to  a  new  situation  $2  which 

violates  some  principle  of  the  domain,  then  the  mental  decision  proce¬ 
dure  that  chose  A  in  S  j  is  faulty.  The  obvious  correction  is  to  constrain 

the  procedure  so  as  to  avoid  executing  A  in  situations  like  S^.  This  re¬ 
quires  that  the  learner  identifies  the  conditions  that  caused  the  error,  i. 
e.,  those  properties  of  that  guaranteed  that  the  error  would  occur  if  A 

were  executed.  Given  knowledge  of  those  conditions,  the  mental  proce¬ 
dure  can  be  revised  so  as  to  avoid  similar  errors  in  the  future. 

The  principle  that  learning  is  error  correction  superficially  resem¬ 
bles  Thomdyke's  Law  of  Effect  which  says  that  actions  with  negative 
consequences  are  gradually  removed  from  the  learner's  behavioral 
repertoire  (while  actions  with  positive  consequences  are  strengthened). 
However,  the  two  principles  are  distinct,  because  a  cognitive  conflict  is 
not  necessarily  associated  with  a  painful  or  unpleasant  outcome,  as  the 
examples  given  previously  illustrate.  The  error  correction  principle  is 
also  superficially  related  to  the  hypothesis  that  learning  is  driven  by  im- 


passes,  i.  e.,  situations  in  which  existing  procedural  knowledge  is  in¬ 
sufficient  to  decide  what  to  do  next  (Newell,  1990;  VanLehn,  1988). 
However,  impasses  are  not  errors.  An  impasse  is  a  situation  in  which 
there  is  insufficient  information  to  make  a  choice,  while  an  error  is  a  bad 
choice. 

The  idea  that  cognitive  change  is  triggered  by  contradictions  and  in¬ 
consistencies  has  been  suggested  repeatedly  in  the  cognitive  sciences.  It 
is  central  to  several  recent  cognitive  models  of  learning.  Holland  et  al. 
(1986)  put  prediction-based  evaluation  of  knowledge  at  the  center  of 
learning:  Knowledge  is  continuously  applied  in  predicting  events  and 
rules  that  lead  to  wrong  predictions  are  modified.  Schank  (1982, 1986) 
has  proposed  the  similar  idea  that  learning  is  triggered  by  expectation 
failures.  In  developmental  psychology,  Piaget  (1985)  designated  cogni¬ 
tive  conflia,  which  he  called  disequilibrium,  as  the  driving  force  of 
cognitive  development.  Empirical  investigations  support  this  hypothesis 
(Murray,  Ames,  &  Botvin,  1977).  Social  psychologists  like  Festinger 
(1957)  have  proposed  that  cognitive  dissonance  causes  individuals  to 
revise  their  beliefs  in  order  to  restore  consistency  (see  Abelson  et  al., 
1968,  for  an  overview  of  cognitive  consistency  theory).  The  hypothesis 
that  belief  revision  serves  to  maintain  consistency  has  also  been  pro¬ 
posed  by  philosophers  (Quine  &  Ullian,  1978)  and  by  science  educators 
(Hewson  &  Hewson,  1984;  Posner  et  al.,  1982). 

Machine  learning  researchers  have  build  systems  that  learn  by  re¬ 
solving  conflicts  (Hall,  1988;  Kocabas,  1991;  Rose  &  Langley,  1986) 
and  by  explaining  errors  (Minton,  1988).  The  problem  of  what  consti¬ 
tutes  a  rational  response  to  a  contradiction  has  been  studied  in  logic  and 
Artificial  Intelligence  under  the  rubric  non-monotonic  logic  (Gaidenfors, 
1988;  McDermott  &  Doyle,  1980).  Finally,  the  idea  that  theory  devel¬ 
opment  in  science  is  driven  by  contradictions  between  theory  and  data 
have  been  formulated  in  different  ways  by  Duhem  (1991/1914),  Kuhn 
(1970),  and  Popper  (1972/1935).  The  relevance  of  these  philosophers 
for  psychology  is  highlighted  by  Berkson  and  Wettersten's  (1984)  at¬ 
tempt  to  recast  Popper's  philosophy  as  a  learning  theory.  In  short,  the 
idea  of  cognitive  change  as  a  response  to  conflict,  contradiction,  or  in¬ 
consistency  has  been  proposed  by  so  many  researchers  independently  of 
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each  other  and  in  so  many  different  fields  that  it  deserves  to  be  recog¬ 
nized  as  one  of  the  great  unifying  principles  of  the  cognitive  sciences. 

Summary 

During  practice  the  learner  continuously  monitors  his  or  her 
progress  by  comparing  the  current  state  of  the  practice  problem  to  his  or 
her  prior  Imowledge  about  the  domain.  A  problem  state  that  contradicts 
something  that  is  known  to  be  true  of  the  domain  indicates  that  an  error 
has  been  made.  When  such  a  contradiction  is  noticed,  the  current  prob¬ 
lem  solving  method  is  constrained  so  as  to  avoid  making  similar  errors 
in  the  future.  As  practice  progresses,  the  general  method  becomes  more 
and  more  constrained  and  better  and  better  adapted  to  the  task  environ¬ 
ment.  Eventually  it  has  become  transformed  into  the  correct  domain- 
specific  skill  and  ceases  to  generate  errors. 

According  to  this  theory,  prior  knowledge  impacts  skill  acquisition 
in  two  ways.  First,  knowledge  allows  the  learner  to  detect  his  or  her  er¬ 
rors.  Facts  and  principles  of  the  domain  generate  implications  that  an  in¬ 
complete  or  incorrect  skill  is  likely  to  violate  or  contradict.  The  more 
knowledge  the  learner  has,  the  higher  the  probability  that  he  or  she  will 
be  aware  of  the  contradictions  and  conflicts  generated  by  a  faulty  solu¬ 
tion  or  a  mistaken  problem  solving  step. 

Second,  prior  knowledge  allows  the  learner  to  identify  the  condi¬ 
tions  that  caused  the  error.  Finding  the  cause  of  an  error  might  require 
complicated  reasoning  about  the  domain.  The  more  knowledge  the 
learner  has,  the  higher  the  probability  that  he  or  she  accurately  identifies 
the  cause,  which  in  turn  is  a  prerequisite  for  successful  error  correction. 

In  short,  the  theory  put  forth  here  claims  that  the  function  of  acquir¬ 
ing  new  skills  through  practice  consists  of  three  main  subfiinctions-to 
generate  task-relevant  behavior,  to  identify  errors,  and  to  correct  errors- 
each  of  which,  in  mm,  can  be  analyzed  into  subfunctions.  The  func¬ 
tional  analysis summarized  in  Figure  1.  Although  the  theory  supports 
qualitative  arguments  and  explanations,  the  derivation  of  quantitative 
behavioral  predictions  requires  a  working  information  processing  sys¬ 
tem. 


I.  Learn  to  do  anfamiliar  task 


A.  Generate  task-relevant  actions 

1.  Apply  forward  search 

a.  Retrieve  possible  actions 

b.  Selea  action 

c.  Execute  action 

B.  Learn  from  erroneous  actions 

1.  Detect  errors 

a.  Check  consistency  between  current  problem 
state  and  prior  kiK)wledge  after  each  aaion 

2.  Correct  error 

a.  Extract  information  from  error 

i.  Identify  the  conditions  under  which  a 
particular  action  is  incorrect 

b.  Revise  current  task  procedure 

i.  Constrain  procedure  so  as  to  avoid  that 
action  under  those  conditions 


Figure  1.  The  functional  analysis  of  learning  from  error. 


A  COMPUTATIONAL  MODEL 


To  move  from  a  functional  theory  to  a  working  model  one  must 
specify  particular  representations  and  processes  that  can  compute  the 
functions  described  in  the  theory.  In  particular,  an  implementation  of  the 
present  theory  requires  (a)  a  performance  mechanism,  including  a  repre¬ 
sentation  for  procedural  knowledge,  (b)  a  representation  for  declarative 
knowledge,  (c)  a  mechanism  for  detecting  errors,  and  (d)  a  mechanism 
for  correcting  errors.  The  particular  model  described  here  is  called  the 
Heuristic  Searcher  (HS). 

A  Standard  Performance  Mechanism 

Memory  Architecture.  HS  has  three  memory  stores.  The 
working  memory  holds  the  model's  knowledge  state,  corresponding  to 
the  learner's  perception  of  the  current  state  of  the  practice  problem.  The 
procedural  memory  holds  the  model's  procedural  knowledge,  corre¬ 
sponding  to  the  learner's  previously  acquired  skills.  The  long-term 
memory  holds  the  model's  declarative  knowledge,  corresponding  to  the 
learner's  prior  knowledge  about  the  domain.  There  is  no  separate  goal 
stack.  Goals  are  represented  in  working  memory. 

Procedural  Knowledge.  Procedural  knowledge  is  represented 
in  so-called  production  rules  (Newell  &  Simon,  1972),  i.  e.,  mles  of  the 
general  form 


Goal,  Situation  -->  Action, 

where  Goal  is  a  description  of  what  the  learner  believes  he  or  she  is 
supposed  to  achieve  in  the  practice  problem,  e.  g.,  "construct  the  struc¬ 
tural  formula  for  C2H5OH,"  and  Situation  is  a  description  of  a  class  of 

situations,  e.  g.,  "situations  in  which  the  carbon  skeleton  of  the 
molecule  has  been  completed  but  no  other  atoms  have  been  connected 
yet."  Formally  speaking,  both  Goal  and  Situation  are  patterns,  i.  e., 
conjunctions  of  elementary  propositions  which  may  or  may  not  contain 
(universally  quantified)  variables. 


The  action  on  the  right-hand  side  of  a  production  rule  is  a  problem 
solving  step  that  the  model  knows  how  to  perform,  e.  g.,  "connect  the 
oxygen  atom  to  one  of  the  carbon  atoms".  Actions  have  applicability 
conditions  that  have  to  be  satisfied  before  they  can  be  applied.  For  ex¬ 
ample,  an  oxygen  atom  cannot  be  attached  to  a  carbon  atom  unless  there 
is  a  carbon  atom  for  it  to  be  attached  to.  Each  action  is  implemented  as  a 
piece  of  Lisp  code  that  revises  the  current  problem  state  by  deleting 
some  propositions  and  adding  others.  Syntactically,  the  actions  are  so- 
called  Strips  operators  (Fikes  &  Nilsson,  1971).  Psychologically,  the 
actions  correspond  to  components  of  the  practice  problem  which  are  un¬ 
problematic  for  the  learner. 

Each  produaion  mle  is  a  single  unit  of  procedural  knowledge,  cor¬ 
responding  to  a  single  problem  solving  heuristic.  The  skill  required  to 
solve  problems  of  a  particular  type,  e.  g.,  to  construct  structural  formu¬ 
las  in  chemistry,  consists  of  a  collection  of  interrelated  rules.  All  pro¬ 
duction  rules  are  stored  in  the  single  production  memory,  without 
stmctural  divisions  between  different  skills. 

Operating  Cycle.  The  model  solves  problems  by  searching  a 
problem  space.  The  content  of  the  working  memory  at  the  time  the  sys¬ 
tem  is  initialized  is  the  initial  state  of  the  search  space.  The  top  goal  im¬ 
plicitly  specifies  the  goal  state.  The  ensemble  of  operators  consists  of 
the  set  of  actions  the  model  has  been  given  as  input.  In  each  cycle  of  op¬ 
eration,  the  Goals  and  Situations  of  the  rules  are  matched  against  the 
working  memory  with  a  version  of  the  RETE  pattern  matching  algo¬ 
rithm  developed  by  Forgy  (1982).  If  a  rule  matches,  its  action  is  exe¬ 
cuted. 

If  more  than  one  rule  matches  the  current  state,  each  matching  rule 
is  evoked  and  one  new  descendant  of  the  current  state  is  generated  for 
each  evoked  rule.  The  entire  search  tree  is  saved  in  memory.  Each  cycle 
begins  with  the  selection  of  which  search  state  to  install  as  the  current 
state  for  that  cycle.  In  some  applications  of  HS,  the  selection  of  the  cur¬ 
rent  state  is  based  on  a  task  specific  evaluation  function,  in  which  case 
the  model  performs  best-first  search.  If  the  evaluation  function  has  the 
right  properties  and,  in  addition,  the  system  checks  for  repeated  occur- 


fences  of  the  same  stated,  then  the  model  executes  the  A*  algorithm 
(Pearl,  1984,  p.  64).  In  the  absence  of  any  evaluation  function,  the  state 
to  expand  next  is  selected  randomly  among  the  immediate  descendants 
of  the  current  state,  in  which  case  the  model  performs  depth-first  search. 
In  psychological  terms,  the  performance  mechanism  correspond  to  the 
hypothesis  that  people  respond  to  uncertainty  by  thinking  through  alter¬ 
native  actions  before  deciding  what  to  do  next. 

A  Representation  for  Declarative  Knowledge 

The  function  of  procedural  knowledge  is  to  control  action.  The 
function  of  declarative  knowledge  is  not  equally  obvious.  Philosophical 
discussions  often  assume  that  the  function  of  declarative  knowl^ge  is 
to  provide  descriptions  of  the  world  ("the  cat  is  on  the  mat"),  prediaions 
about  future  events  ("the  sun  will  rise  tomorrow"),  or  explanations  ("it 
is  snowing,  because  the  temperature  fell").  The  epistemological,  logical, 
and  semantic  riddles  associated  with  these  functions  have  exercised 
thinkers  in  a  variety  of  disciplines  for  centuries. 

The  HS  model  is  based  on  a  different  view  of  the  nature  and  func¬ 
tion  of  declarative  knowledge.  Declarative  knowledge  is  not  used  either 
to  describe,  predict,  or  explain  but  to  circumscribe  a  set  of  states  of  the 
world.  The  unit  of  declarative  knowledge  is  a  constraint.  Constraints 
can  be  interpreted  descriptively,  i.  e.,  as  circumscribing  the  set  of  pos¬ 
sible  states  of  the  world.  For  example,  the  law  of  conservation  of  mass 
claims  that  the  mass  of  the  reactants  in  a  chemical  experiment  is  equal  to 
the  mass  of  the  reaction  products.  Mass  is  neither  created  nor  destroyed 
in  a  chemical  reaction,  so  the  mass  of  the  inputs  is  always  equal  to  the 
mass  of  the  outputs.  The  point  of  the  mass  conservation  law  is  that  it 
circumscribes  situations  in  which  mass  is  conserved,  which  are  possi¬ 
ble,  and  separates  them  from  situations  in  which  mass  is  not  conserved 
and  that  it  rules  out  the  latter  as  impossible.  Figure  2  shows  the  con¬ 
straint  interpretation  of  the  mass  conservation  law. 

Constraints  are  not  limited  to  representing  abstract  principles  like 
the  law  of  conservation.  Particular  facts  are  also  constraints.  For  exam- 


^This  hKility  is  computationally  expensive  and  is  usually  switched  off. 


Example  1:  A  scientific  principle 
Idiomatic  EngUsh: 

Constraint  fomudation: 

Formal  representation: 


Energy  cannot  be  created  or  des¬ 
troy^. 

If  the  mass  of  the  reactants  for  a 
chemical  experiment  is  Mj  and 

the  mass  of  the  products  is 
then  Mj  must  be  equal  to  M2. 

(Reaaants  R)  (Mass  R  M^ 
(Products  P)  (Mass  P  M2) 

**  (Equal  Mj  M2) 


Figure  2.  Encoding  a  scientific  principle  as  a  constraint. 

pie,  the  fact  that  alcohol  molecules  have  an  OH-group  corresponds  to 
the  constraint  that  a  structural  formula  for  an  alcohol  had  better  have  an 
OH-group  somewhere.  Figure  3  shows  the  constraint  interpretation  of 
this  fact. 

Constraints  can  also  be  interpreted  prescriptively,  i.  e.,  as  circum¬ 
scribing  the  set  of  desired  states  of  the  world.  The  ordinance  that  one 
should  not  drive  along  a  one-way  street  in  the  wrong  direction  is  a  con¬ 
straint.  Specifically,  the  fact  that  Fifth  Avenue  is  one-way  in  the  west¬ 
erly  direction  corresponds  to  the  constraint  that  if  you  are  driving  on 
Fi^  Avenue,  you  had  better  be  heading  west.  It  is  not  impossible  to 
head  east,  it  is  merely  undesirable.  Figure  4  shows  the  constraint  inter¬ 
pretation  of  this  ordinance. 

It  is  a  mistake  to  try  to  classify  individual  constraints  as  either  de¬ 
scriptive  or  prescriptive.  All  constraints  can  be  interpreted  in  both  ways, 
because  the  two  interpretations  determine  each  other.  It  is  desirable  that 
a  chemistry  experiment  satisfies  the  constraint  that  the  mass  of  the  reac- 


Example  2:  A  scientific  fact 
Idiomatic  English: 

Constraint  formulation: 

Formed  representation: 


Every  alcohol  molecule  has  an 
OH-group. 

If  X  is  an  alcohol  molecule,  then 
it  must  have  an  OH-group. 

(Isa  X  molecule) 

(Substance  X  ALCOHOL) 

**  (Isa  Y  OH-GROUP) 
(Part-ofYX) 


Figure  3.  Encoding  a  scientific  fact  as  a  constraint. 

tants  is  equal  to  the  mass  of  the  reaction  products.  If  this  is  not  the  case, 
then  some  error  was  committed  in  the  execution  of  the  laboratory  proce¬ 
dure,  i.  e.,  some  mass  was  accidentally  lost  or  the  experiment  was  con¬ 
taminated  in  some  way  (Gensler,  1987).  The  constraint  expressed  in  the 
mass  conservation  law  acquires  a  prescriptive  function  because  it  can  be 
interpreted  descriptively;  a  laboratory  procedure  ought  to  conform  to  it 
precisely  because  it  is  true.  The  descriptive  and  prescriptive  aspects  of 
constraints  are  inseparable. 

The  main  contribution  of  the  HS  model  is  a  formal  representation 
for  constraints  and  a  set  of  processes  for  using  them.  A  constraint  C  is 
represented  as  an  ordered  pair 


<Cr,  C^> 


where  is  a  relevance  criterion,  i.  e.,  a  specification  of  the  circum¬ 
stances  under  which  the  constraint  applies,  and  is  a  satisfaction  cri¬ 
terion,  i.  e.,  a  condition  that  has  to  be  met  for  the  constraint  to  be  satis- 


Example  3;  An  everyday  fact 
Idiomatic  English: 

Constraint formulation: 

Formal  rqjreseniation: 


Fifth  Avenue  is  a  one-way  street 
heading  west. 

If  someone  is  driving  on  Fifth 
Avenue,  then  he  or  she  ought  to 
travel  westwards. 

(State  X  DRIVING) 

(Location  X  FIFTH-AVENUE) 
**  (Direaion  X  WEST) 


Fignre  4,  Encoding  an  everday  fact  as  a  constraint. 

fied.  To  continue  the  traffic  example,  if  Fifth  Avenue  is  one-way  in  the 
westerly  direction,  then  "driving  on  Fifth  Avenue"  is  the  relevance  cri¬ 
terion  and  "is  heading  west"  is  the  satisfaction  criterion.  If  I  am  not  on 
Fifth  Avenue,  the  direction  of  my  travel  is  not  constrained  by  this  ordi¬ 
nance,  but  when  I  am  on  Fifth,  then  I  had  better  be  driving  west  rather 
than  east.  In  the  mass  conservation  example,  "M^  is  the  mass  before  the 

reaction  and  M2  is  the  mass  after  the  reaction"  is  the  relevance  criterion, 
while  the  equality  "M  j  -  M2"  is  the  satisfaction  criterion. 

The  double  star  connective  (♦♦)  that  appears  in  Figures  2-4  is  not  a 
symbol  for  logical  implication.  Constraints  are  not  inference  rules;  they 
do  not  generate  conclusions.  Nor  are  they  production  rules;  they  do  not 
fire  operators.  The  semantics  of  the  double  star  connective  is  similar  to 
the  meaning  of  "ought  to",  "had  better",  and  related  phrases.  The  inter¬ 
pretation  of  a  constraint  <Cj.,  is  that  whenever  Cj.  is  the  case, 

ought  to  be  the  case  as  well  (or  else  something  has  gone  awry). 
Syntactically,  both  and  are  patterns,  i.  e.,  conjunctions  of 

propositions  similar  to  the  condition  side  of  a  production  rule. 


The  HS  model  does  not  have  any  mechanism  for  acquiring  or  revis¬ 
ing  its  declarative  knowledge.  The  constraints  are  input  by  the  user  and 
they  stay  unchanged  throughout  a  simulation  run.  The  purpose  of  the 
constraints  is  to  facilitate  the  detection  and  correction  of  errors. 

A  Mechanism  for  Error  Detection 

At  the  beginning  of  each  operating  cycle,  all  production  rules  are 
matched  against  working  memory,  the  rules  with  matching  condition 
sides  are  evoked,  the  actions  of  those  rules  are  executed,  and  new 
problem  states  thus  generated.  Each  new  state  is  matched  against  all  the 
available  constraints.  (The  match  is  computed  with  the  same  pattern 
matcher  which  matches  the  production  rules.)  Constraints  with  non¬ 
matching  relevance  patterns  do  not  warrant  any  action  on  the  part  of  the 
system,  because  they  are  irrelevant.  Constraints  which  have  matching 
relevance  panems  and  also  matching  satisfaction  patterns  are  ignored  as 
well.  The  new  state  is  consistent  with  the  those  constraints  so  no  action 
is  required.  On  the  other  hand,  if  a  constraint  with  a  matching  relevance 
pattern  has  a  non-matching  satisfaction  pattern,  then  the  new  state  vio¬ 
lates  that  constraint  and  some  response  or  action  is  called  for.  Such  a 
constraint  violation  signals  that  something  is  wrong  with  the  procedure 
that  generated  the  current  state;  an  error  has  been  committed. 

Specifically,  consider  a  rule  R  with  goal  G  and  a  conjunction  S  of 
situation  features  in  its  left-hand  side  and  a  single  action  A  in  its  right- 
hand  side. 


R;  G,  S  A, 

and  a  constraint  C  with  relevance  pattern  C,.  and  satisfaction  pattern  C^, 

C  -  <Cr,C5>, 


where  both  C,.  and  are  conjunctions  of  situation  features.  In  particu- 


lar,  let  us  assume  that  and  each  consists  of  two  features: 


C/  &  C/ 


and 


Cj  -  Cj'  &  C5". 

Finally,  let  us  assume  that  the  effect  of  action  A  is  to  add  the  conjunction 
of  Cj."  and  C^'  to  the  current  problem  state,  i.  e., 

A  -  AddlC^"  &  C5']. 

If  a  learner  with  rule  R  and  constraint  C  encounters  a  problem  state 
Sj  described  by 


S  &  C/, 

then  the  left-hand  side  of  R  is  satisfied  because  S  is  present,  so  the  rule 
will  be  evoked  and  action  A  executed.  The  effect  is  that  C^’  and  are 

added  to  S  j,  yielding  a  new  problem  state  $2  described  by 

S  &  C/  &  C/’  &  Cj’. 

In  this  problem  state,  both  and  C^”  are  present,  so  matches,  i. 
e.,  the  constraint  is  relevant.  Although  C^'  is  present,  C^"  is  not,  so 
is  violated;  hence,  doing  A  in  situation  Sj  was  an  error. 

In  principle,  there  are  two  possible  interpretations  of  the  constraint 
violation:  The  fault  might  lie  either  with  the  procedural  knowledge-the 
rule-or  with  the  declarative  knowledge~thc  constraint.  Because  HS 
was  designed  to  model  skill  acquisition,  as  opposed  to  the  acquisition  of 
declarative  knowledge,  it  assumes  that  the  mle  rather  than  the  constraint 
is  at  fault. 


A  Mechanism  for  Error  Correction 


A  constraint  violation  is  a  signal  that  the  procedural  knowledge  that 
generated  the  current  problem  state  is  faulty  and  needs  to  be  revised.  HS 
assumes  that  the  fault  lies  with  the  last  rule  to  fire.  The  problem  of  how 
to  learn  from  the  constraint  violation  can  be  stated  as  follows:  Given 
that  rule  R, 


R:  G,S  -->A, 

was  applied  to  state  and  that  it  generated  state  S2  and  that  $2  violates 

constraint  C,  how  should  the  rule  be  revised?  The  purpose  of  the  revi¬ 
sion  is  to  avoid  similar  constraint  violations  in  the  future.  The  learning 
mechanism  in  the  HS  model  accomplishes  this  by  finding  the  cause  of 
the  constraint  violation,  i.  e.,  the  properties  of  state  that  were  re¬ 
sponsible  for  the  error,  and  revising  mle  R  so  that  it  does  not  apply  un¬ 
der  those  conditions.  The  learning  mechanism  finds  the  relevant  proper¬ 
ties  of  by  regressing  the  violated  constraint  through  the  rule  with  a 

variant  of  the  standard  regression  algorithm  used  in  many  A.  I.  systems 
(Nilsson,  1980,  p.  288). 

More  specifically,  rule  R  is  replaced  with  two  new  rules  R'  and 
R",  representing  two  different  revisions  of  R.  The  purpose  of  the  first 
revision  is  to  constrain  R  so  that  the  new  rule  will  apply  only  in  situa¬ 
tions  in  which  constraint  C  is  guaranteed  to  remain  irrelevant.  This  is 
accomplished  by  regressing  the  relevance  pattern  through  the  rule. 
Continuing  the  example  from  the  previous  subsection,  regressing  the 
relevance  pattern  Cj. «  (Cj.'  &  Cp")  through  the  operator  A  »  Add[Cj." 

&  Cj’]  yields  C^'  as  the  only  output  (see  Nilsson,  1980,  p.  288,  for  an 

explanation  of  the  regression  algorithm).  The  first  new  rule  is  con¬ 
structed  by  adding  the  negation  of  the  output  from  the  regression  to  the 
original  rule: 


This  rule  applies  only  in  those  situations  in  which  the  consU'aint  is  guar¬ 
anteed  to  remain  irrelevant  if  action  A  is  executed.  Psychologically,  the 
rule  corresponds  to  the  knowledge  that  one  should  only  do  A  when  S  is 
true  but  is  false  (e.  g.,  "if  the  device  needs  repair  and  the  power  is 

not  on,  then  open  the  front  panel"). 

The  purpose  of  the  second  revision  is  to  constrain  rule  R  so  that  it 
applies  only  in  situations  in  which  the  constraint  C  is  guaranteed  to  be¬ 
come  both  relevant  and  satisfied  if  A  is  executed.  This  is  accomplished 
by  regressing  the  entire  constraint  through  the  rule,  instead  of  the  rele¬ 
vance  pattern.  Regressing  (C^'  &  Cj."  &  &  C^")  through  the  opera¬ 

tor  A  -  Add[Cj."  &  Cj'l  yields  (Cj.'  &  Cg")  as  the  output  (see  Nilsson, 

1980,  p.  288).  The  second  new  rule  is  constructed  by  adding  this  result 
to  the  original  rule  (without  negating  it): 

R":  S  &  C/  &  Cj"  ->  A 

This  rule  applies  only  in  those  situations  in  which  the  constraint  is  guar¬ 
anteed  to  become  satisfied  if  A  is  executed.  Psychologically,  the  rule 
corresponds  to  the  knowledge  that  one  should  only  do  A  when  S,  C^.’, 

and  Cj"  are  all  true  (e.  g.,  "if  the  device  needs  repair,  the  power  is  on, 

and  the  red  light  is  blinking,  then  switch  off  the  power"). 

Figure  5  provides  a  graphical  interpretation  of  the  learning  mecha¬ 
nism.  The  set  S  of  situations  in  which  the  original  rule  R  applies  is  split 
into  three  subsets  when  the  rule  is  revised.  The  first  subset  contains 
those  simadons  in  which  the  constraint  is  guaranteed  to  remain  irrelevant 
if  action  A  is  executed.  They  are  covered  by  the  first  new  rule.  The  sec¬ 
ond  subset  contains  those  situations  in  which  the  constraint  is  guaran¬ 
teed  to  become  satisfied  if  A  is  executed.  They  are  covered  by  the  sec¬ 
ond  new  rule.  The  third  subset  contains  those  situations  in  which  doing 
A  leads  to  a  constraint  violation.  They  are  thrown  away,  as  it  were. 
Neither  of  the  two  new  rules  apply  in  those  situations,  so  the  error  type 
represented  by  the  third  subset  has  been  eliminated. 

The  fact  that  one  type  of  error  has  been  eliminated  does  not  imply 
that  the  two  new  rules  R'  and  R"  are  correct.  Although  the  new  rules 
have  been  revised  so  as  to  be  consistent  with  one  constraint,  they  might 


Set  of  situations  in  which 
initial  rule  applies,  v 
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still  violate  other  constraints  and  so  have  to  be  revised  further.  Repeated 
revisions  of  rules  is  the  standard  case  in  HS  learning.  Also,  the  fact  that 
one  rule  has  been  revised  does  not  imply  that  other  rules  are  correct. 
Learning  proceeds  by  gradual  correction  of  the  relevant  rule  set  as  a 
function  of  the  errors  that  the  model  encounters  during  practice.  A  de¬ 
tailed  analysis  of  the  correction  of  an  entire  rule  set  is  available  in 
Ohlsson  and  Rees  (1991a,  Table  5). 

Discussion 

The  HS  model  is  based  on  two  representational  assumptions;  that 
procedural  knowledge  is  represented  in  production  rules  and  that 
declarative  knowledge  is  represented  in  constraints.  The  production 
system  format  was  proposed  by  Newell  and  Simon  (1972)  but  has  been 
taken  up  by  other  researchers  (Klahr,  Langley,  &  Neches,  1987).  The 
main  claim  of  the  production  system  hypothesis  is  that  human  action  is 
determined  by  an  external  context,  represented  by  the  situation  the 
learner  is  faced  with,  and  an  internal  context,  represented  by  the 
learner's  goal.  Procedural  knowledge  consists  of  associations  between 
goals,  situations,  and  actions.  The  individual  production  rule  is  the 
smallest  unit  of  procedural  knowledge;  it  maps  a  single  goal/situation 
pair  onto  a  particular  action. 

A  second  claim  of  the  production  system  hypothesis  is  that  the  units 
of  procedural  knowledge  are  modular,  ^oduction  rules  do  not  access  or 
operate  upon  each  other.  They  only  interact  through  their  effects  on 
working  memory.  There  is  strong  empirical  evidence  for  the  modularity 
of  procedural  knowledge  (Anderson,  1993). 

The  constraint  format  originated  with  the  current  theoretical  effort 
(Ohlsson  &  Rees,  1991a)  and  it  does  not  have  any  empirical  or  theoreti¬ 
cal  support  other  than  the  success  of  the  model  it  is  embedded  in.  There 
has  been  so  little  progress  on  the  epistemological,  logical,  and  semantic 
problems  associated  with  the  standard,  propositional  interpretation  of 
declarative  knowledge  that  any  alternative  conception  is  worth  explor¬ 
ing. 
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Given  the  two  representational  assumptions,  information  process¬ 
ing  mechanisms  that  compute  the  functions  specified  in  the  abstraa  the¬ 
ory  (see  Figure  1)  can  be  specified.  In  HS,  the  function  of  generating 
task  relevant  activity  is  carried  out  by  forward  search,  the  function  of 
detecting  errors  is  carried  out  by  a  pattern  matcher,  and  the  function  of 
correcting  errors  is  carried  out  by  a  rale  revision  algorithm  based  on  re¬ 
gression.  There  are  alternative  ways  to  compute  each  of  these  functions. 
HS  could  have  been  implemented  with,  for  example,  analogical  transfer 
instead  of  heuristic  search  as  the  weak  method  responsible  for  generat¬ 
ing  task  relevant  behavior.  Similar  substitutions  of  alternative  mecha¬ 
nisms  are  possible  for  each  of  the  other  functions  specified  in  the  the¬ 
ory.  The  predictions  generated  by  running  the  model  are  consequences 
of  both  the  theoretical  principles  that  guided  its  design  and  the  particular 
representations  and  processes  that  are  implemented  in  it. 

Compared  to  many  other  machine  learning  systems,  HS  is  very 
simple.  It  combines  a  standard  production  system  architecture,  a  well- 
known  weak  method,  and  an  off-the-shelf  regression  algorithm;  little 
else  is  needed.  HS  is  implemented  in  Lucid  Common  Lisp  and  runs  on  a 
Sun  Sparcstation  1+  with  16  megabytes  of  main  memory.  The  core 
mechanisms  have  been  debugged  in  hundreds  of  simulation  runs  in  dif¬ 
ferent  domains  over  a  period  of  four  years  and  are  very  robust. 


APPLICATIONS  TO  CLASSICAL  RESEARCH  PROBLEMS 

A  good  theory  should  throw  new  light  on  the  perennial  problems  of 
the  discipline.  The  learning  curve  and  transfer  of  training  have  been 
central  problems  in  the  theory  of  learning  for  a  long  time. 

The  Learning  Curve 

Background.  If  performance  level,  measured  in  terms  of  time  to 
complete  a  practice  problem,  is  plotted  as  a  function  of  amount  of  prac¬ 
tice,  measured  in  terms  of  the  number  of  practice  problems  solved,  i.  e., 
the  number  of  trials,  the  result  is  a  negatively  accelerated  curve.  The  rate 
of  improvement  is  fastest  at  the  beginning  of  practice  and  quickly  slows 


down  as  mastery  is  approached.  This  type  of  learning  curve  has  been 
observed  in  a  large  numbe.'  of  studies,  across  many  different  tasks,  and 
in  widely  varying  subject  populations  (Lane,  1987;  Mazur  &  Hastie, 
1978;  Newell  &  Rosenbloom,  1981;  Ohlsson,  1992c). 

Armchair  reasoning  would  lead  one  to  expea  learning  to  be  slow  in 
the  beginning,  when  the  learner  is  still  groping  to  understand  the  prac¬ 
tice  task  and  there  is  little  relevant  knowledge  or  skill  to  build  on.  Later 
in  the  practice  sequence,  the  partial  knowledge  built  up  during  previous 
trials  serves  as  a  lever  for  acquiring  more  knowledge,  with  inaeased 
speed  of  learning  as  a  result.  However,  research  leaves  no  doubt  that  the 
opposite  is  the  case:  The  rate  of  skill  acquisition  is  faster  the  less  the 
learner  knows  about  the  task.  No  theory  of  praaice  is  unless  it 
can  explain  this  unexpected  finding. 

The  hypothesis  that  skill  acquisition  is  the  elimination  of  errors 
provides  such  an  explanation.  According  to  this  hypothesis,  knowledge 
is  revised  when  the  learner  becomes  aware  of  an  error.  Learning  is  thus 
a  sequence  of  learning  events,  with  one  error  (type)  being  eliminated  per 
event.  The  prediction  of  a  negatively  accelerated  learning  curve  follows 
fixjm  this  hypothesis  in  three  easy  steps: 

1.  The  consequence  of  an  error  is  floundering,  i.  e.,  unnecessary 
search.  Performance  improves  when  the  error  is  corrected  be¬ 
cause  the  unnecessary  search  is  eliminated.  Let  us  assume  that 
the  amount  of  unnecessary  search  caused  by  an  error  is  approx¬ 
imately  constant  across  errors.  Performance  then  improves  with 
a  constant  amount  per  learning  event. 

2.  At  the  outset  the  learner  makes  many  errors  on  each  practice 
problem  precisely  because  he  or  she  knows  so  little  about  the 
task.  As  mastery  is  approached,  the  number  of  mistakes  per 
problem  decreases  because  many  errors  have  already  been  elimi¬ 
nated.  There  are  fewer  and  fewer  learning  events  per  trial  as 
practice  progresses. 

3.  Constant  improvement  per  learning  event  and  deaeasing  number 
of  learning  events  per  trial  imply  a  decreasing  rate  of  improve¬ 
ment  per  trial. 

This  explanation  does  not  depend  on  the  details  of  particular  infor¬ 
mation  mechanisms.  Any  theory  or  model  which  claims  that  learning 


events  are  triggered  by  trouble  situations-defmed  as  cognitive  conflicts, 
contradictions,  errors,  expectation  failures,  impasses,  wrong  answers  or 
in  any  other  way— implies  this  explanation,  because  trouble  situations 
disappear  as  mastery  is  approached,  by  definition  of  "mastery." 

The  qualitative  argument  explains  why  we  should  expect  the  rate  of 
improvement  to  slow  down  across  trials,  but  it  does  not  make  a  specific 
prediction  about  the  shape  of  the  learning  curve.  Newell  and 
Rosenbloom  (1981)  have  reviewed  the  evidence  that  the  human  learning 
cur.c  is  a  member  of  the  class  of  curves  described  by  so-called  power 
laws,  i.  e.,  by  equations  of  the  general  form 

T  =  A  +  kP'‘' 

where  T  is  the  time  to  complete  the  current  practice  problem,  A  is  the 
asymptotic  performance,  P  is  the  amount  of  practice  in  trials,  and  k  and 
r  are  constants. 

Simulating  the  Learning  Curve.  To  derive  the  learning  curve 
predicted  by  the  present  theory,  a  simulation  experiment  was  run  with 
the  HS  model.  A  problem  solving  skill  from  the  domain  of  chemistry 
was  chosen  as  the  target  for  the  simulation.  Chemists  frequently  need  to 
know  the  interconnections  between  the  atoms  in  a  molecule.  The  inter¬ 
connections  are  specified  in  structural  formulas,  so-called  Lewis  struc¬ 
tures.  A  Lewis  structure  shows  which  atoms  in  a  molecule  are  bound  to 
which  other  atoms  and  by  which  kind  of  bond.  The  task  of  constructing 
the  Lewis  structure  for  a  particular  molecule,  specified  through  its 
molecular  (sum)  formula,  will  here  be  called  a  Lewis  problem.  Figure  6 
shows  the  initial  state  and  the  goal  state  of  a  Lewis  problem.  There  is 
usually  more  than  one  path  to  the  goal  state.  Figure  7  shows  one  such 
path.  The  cognitive  skill  of  solving  Lewis  problems  is  taught  in  the  be¬ 
ginning  of  college  level  courses  in  organic  chemistry  (e.  g.,  Solomons, 
1988). 

The  HS  model  was  given  a  representation  for  atoms,  molecules, 
valencies,  bonds  between  atoms,  and  the  other  entities,  properties  and 
relations  that  are  important  in  the  chemistry  environment.  The  actions 
involved  in  Lewis  problems  are  to  select  atoms,  to  connect  atoms,  to 


Initial  state: 

A  sumfomada 


CH3CH2OH 


Goal  state: 

A  Lewis  structure 

H  H 

I  I 

H-  C-C-O-H 

1  I 

H  H 


Figure  6.  Initial  state  and  goal  state  for  a  Lewis  problem. 

make  double  bonds,  and  so  on.  Figure  8  summarizes  the  problem  space 
for  Lewis  problems. 

In  order  to  attempt  to  solve  practice  problems,  HS  must  be  given  an 
initial  procedure.  In  this  application,  the  model  was  given  a  set  of  very 
general  initial  rules  that  encode  a  procedure  for  how  to  construct  Lewis 
structures  that  approximates  the  verbal  recipes  given  in  chemistry  text¬ 
books  (e.  g.,  Solomons,  1988,  pp.  10-11;  Sorum  &  Boikess,  1981, 
pp.  104-107).  Finally,  in  order  to  detect  and  correct  its  errors,  the  model 
must  have  some  prior  knowledge  about  the  domain.  It  was  given  a  set 
of  constraints  that  encode  some  relevant  facts  about  the  chemistry  of  al¬ 
cohols,  ethers,  and  pure  hydrocarbons. 

Nine  molecules-three  alcohols,  three  ethers,  and  three  hydrocar- 
bons-were  selected  as  practice  problems.  The  model  solved  each  of  the 
nine  problems,  presented  in  random  order.  This  corresponds  to  the 
simulation  of  a  single  subject  going  through  a  sequence  of  nine  different 


1.  Connect  the  carbons: 


C-C 


2.  Attach  the  oxygen:  C  -  C  -  O 


3.  Complete  the  OH-group:  C  -  C  -  O  -  H 


H  H 

I  I 

4.  Distribute  the  hydrogens:  H-C-C-O-H 

I  I 

H  H 

H  H 

I  I 

5.  Add  electron  pairs:  H-C-C-O-H 

I  I 

H  H 


Figure  7.  A  solution  path  for  the  Lewis  problem  in  Figure  6. 


Representation 

Symbols  that  represent  atoms,  electron  pairs,  molecules,  noble  gas 
configurations,  numbers,  single,  double  and  tripple  bonds,  sub¬ 
stances,  types  of  carbon  arrangements  (branched  structures,  chains, 
and  rings),  two-dimensional  spatial  relations,  and  valencies. 

Initial  state 

A  molecular  (sum)  formula. 

Operators 

Select  an  atom,  place  the  first  atom,  attach  an  atom  to  the  molecule, 
identify  open  bonds,  create  multiple  bonds,  and  add  electron  pairs. 

Goal  state 

A  correa  Lewis  structure  for  the  given  molecule.  A  Lewis  structure 
must  (a)  connect  all  the  atoms  in  the  sum  formula,  (b)  not  include 
any  other  atoms  than  those  in  the  sum  formula,  (c)  have  a  number  of 
valence  electrons  equal  to  the  sum  of  the  valence  electrons  of  the 
atoms,  and  (d)  give  each  atom  a  noble  gas  configuration. 


Figure  8.  A  problem  space  for  Lewis  problems. 

practice  problems.  The  model  was  then  re-initialized  and  run  through  the 
nine  problems  once  again,  simulating  a  second  subject.  All  in  all,  the 
model  worked  through  the  nine  practice  problems  357  times,  each  time 
in  a  different  random  order,  thus  simulating  a  learning  experiment  with 
that  number  of  subjects.  Figure  9  summarizes  the  initial  knowledge,  the 
training  procedure,  and  the  outcome  of  the  chemistry  simulation. 

Tl»  data  from  the  simulation  mns  were  aggregated  by  averaging  the 
performance  of  all  357  simulated  subjects  for  each  trial.  The  average 
performance  on  each  trial  was  plotted  as  a  function  of  trial  number. 
(This  corresponds  to  how  learning  curves  are  constructed  from  psycho- 


Prior  procedural  knowledge 

The  model  began  with  a  procedure  that  connects  the  heavy  atoms, 
adding  multiple  bonds  if  needed,  connects  the  hydrogens,  and  then 
adds  the  final  electron  pairs.  This  procedure  generates  correct  Lewis 
structures,  but  requires  large  amounts  of  search. 

Prior  declarative  knowledge 

There  were  16  constraints  which  encode  knowledge  about  (a)  prop¬ 
erties  of  particular  classes  of  molecules,  e.  g.,  that  alcohols  have  a 
C-O-H  group  and  that  ethers  have  a  C-O-C  group,  (b)  spatial  prop¬ 
erties  of  the  possible  carbon  skeletons  (branched  stmctures,  chains, 
and  rings),  and  (c)  the  distribution  of  hydrogens  across  the 
molecule. 

Training 

The  model  was  given  unsuperyised  practice  on  a  mixed  set  of  Lewis 
problems  that  included  alcohols,  ethers,  and  pure  hydrocarbons. 

Learning  outcome 

The  model  learned  a  set  of  rules  for  constructing  Lewis  strucmres 
for  the  relevant  molecules  with  a  minimal  amount  of  search. 


Figure  9.  Summary  of  the  chemistry  simulation. 

logical  data.)  Figure  10  shows  the  results.  Performance  as  a  function  of 
praaice  approximates  a  straight  line  when  plotted  with  logarithmic  co¬ 
ordinates  on  both  axes,  the  hallmark  of  a  curve  described  by  a  power 
law.  The  HS  model  thus  predicts  that  improvement  over  time  follows 
the  particular  shape  that  has  been  observed  in  data  from  human  learning. 

The  qualitative  argument  for  why  learning  from  error  predicts  a 
negatively  accelerated  learning  curve  is  based  on  the  simplifying  as¬ 
sumption  that  there  is  a  constant  improvement  per  learning  event.  How 


Log  of  Errors 
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Figure  10.  Performance  as  a  function  of  trials. 

realistic  is  this  assumption?  The  assumption  is  true  in  approximately 
uniform  task  environments.  By  approximately  uniform  I  mean  that  the 
average  branching  factor  in  a  small  neighborhood  around  a  search  state 
is  equal  for  all  states  in  the  search  space.  If  this  is  true  and  if  perfor¬ 
mance  is  plotted  as  a  function  of  learning  events  instead  of  as  a  function 
of  trials,  then  the  results  should  be  a  linear  relationship  with  negative 
slope.  Figure  1 1  shows  the  results  from  a  simulation  run  in  which  HS 
was  given  repeated  practiced  on  a  particular  Lewis  problem.  When  per¬ 
formance  is  plotted  as  a  function  of  learning  events,  the  result  approxi¬ 
mates  a  negative  linear  relationship,  indicating  that  the  chemistry  envi¬ 
ronment  is,  in  fact,  approximately  uniform.  An  empirical  test  of  the 
predicticm  that  human  learning  is  linear  in  the  number  of  learning  events 
(in  this  task  environment)  is  possible  in  principle  but  requires  a  method 
iat  identifying  learning  events  in  human  data. 


Errors 


Lssrning  Evsnts 

Figure  11.  Parformance  as  a  function  of  learning  events. 


Transfer  of  Training 

Background.  Knowledge  must  be  applicable  in  other  situations 
than  the  one  in  which  it  was  learned  in  order  to  be  useful,  but  many  lab¬ 
oratory  studies  have  recorded  little  or  no  transfer  of  procedural  knowl¬ 
edge  even  between  isomorphic  problems  (Cormier  &.  Hagman,  1987; 
Singley  &  Anderson,  1989,  Chap.  1).  Although  many  models  of  learn¬ 
ing  try  to  elucidate  the  mechanism  of  transfer  (Ohlsson,  1987a;  Singley 
&  Anderson,  1989),  the  empirical  data  imply  that  the  main  task  for  a 
transfer  model  is  to  elucidate  why  tranter  of  training  does  not  occur.  In 
spite  of  the  negative  findings,  psychologists  keep  trying  to  identify 
conditions  that  produce  transfer,  presumably  because  the  findings 
strongly  contradict  our  experience  of  ourselves  as  creatures  with  general 


and  flexible  competence.  A  second  task  for  a  theory  of  skill  acquisition 
is  to  resolve  this  appearent  contradiction  between  the  laboratory  findings 
and  our  intuitive  self-understanding. 

The  production  system  hypothesis  solves  the  first  of  these  explana¬ 
tory  tasks.  If  procedural  knowledge  is  encoded  in  production  mles  and 
if  the  rules  required  to  solve  a  training  task  A  are  different  from  the  rules 
required  to  solve  a  target  task  B,  then  practice  on  A  will  not  affect  the 
amount  of  learning  required  to  master  B,  which  is  the  typical  laboratory 
result.  Production  rules  are  task  specific,  so  they  do  not  transfer. 

From  the  point  of  view  of  common  sense,  the  lack  of  transfer  of 
training  between  isomorphic  problems  is  particularly  puzzling.  A  series 
of  experiments  with  different  versions  of  Duncker's  ray  problem  has 
shown  that  unless  subjects  are  explicitly  reminded  of  the  training  task, 
transfer  to  an  isomorphic  target  task  is  limited  (Gick  &  Holyoak,  1987, 
pp.  34-37).  Other  experiments  have  verified  that  people  behave  differ¬ 
ently  on  isomorphs  of  the  Tower  of  Hanoi  problem  (Hayes  &  Simon, 
1977)  as  well  as  on  different  isomorphs  of  the  so-called  selection  task 
(see  Evans,  1982,  (Dhap.  9,  for  a  review). 

According  to  the  production  system  hypothesis,  these  results  are  to 
be  expeaed.  Production  rules  for  moving  disks  between  pegs  cannot 
also  transfer  globes  among  monsters;  production  rules  that  split  up  and 
recombine  X-rays  cannot  also  split  up  and  recombine  army  platoons; 
production  rules  that  decide  whether  envelopes  have  the  proper  postage 
cannot  also  test  abstract  rules;  and  so  on.  Production  rules  contain  vari¬ 
ables,  but  they  quantify  over  arguments  to  predicates,  not  over  predi¬ 
cates.  There  is  no  reason  to  expect  a  production  rule  to  facilitate  the 
construction  of  other  rules  isomorphic  to  itself,  particularly  not  if  the 
intended  isomorphism  is  unknown  to  the  learner. 

The  non-transferability  of  procedural  knowledge  leaves  us  with  a 
picture  of  human  beings  as  brittle  systems  which  can  only  solve  the  very 
tasks  that  they  have  practiced.  If  this  is  mie,  then  how  do  we  survive 
even  a  single  day  of  normal  life? 

The  first  answer  is  that  the  zero  transfer  prediction  must  be  moder¬ 
ated  by  the  distinction  between  far  transfer,  in  which  the  target  task  dif¬ 
fers  completely  from  the  training  task,  and  near  transfer,  in  which  the 
two  tasks  panially  overlap.  In  far  transfer  situations  (which  include  al- 


most  all  instructionally  relevant  situations)  there  is  no  overlap  in  the 
production  rules  for  the  two  tasks  and  the  production  system  hypothesis 
predicts  zero  transfer.  In  the  near  transfer  case,  on  the  other  hand,  there 
are  rules  in  the  procedure  for  the  training  task  which  are  identical  to 
rules  in  the  procedure  for  the  target  task.  In  this  case,  there  will  be  a 
transfer  effect.  Singely  and  Anderson  (1989)  claim  that  the  number  of 
production  rules  shared  between  two  tasks  is  a  good  predictor  of  the 
amount  of  transfer  in  near  transfer  situations. 

The  second  and  more  important  answer  suggested  by  the  present 
theory  is  that  generality  resides  in  a  person's  declarative  knowledge 
rather  than  in  his  or  her  procedural  knowledge.  It  is  our  concepts  and 
beliefs  about  the  world  that  transfer  from  one  situation  to  another,  rather 
than  our  skills.  We  understand  how  the  world  works  well  enough  so 
that  we  are  able  to  construct  the  procedural  knowledge  required  by  novel 
circumstances  and  conditions.  We  cope  by  generating  new  procedures, 
not  by  transferring  old  procedures  to  new  situations. 

This  explanation  suggests  that  psychologists  have  been  studying 
the  wrong  paradigm.  Transfer  studies  have  focussed  on  pairs  of  tasks 
which  have  similar  solutions.  In  the  typical  transfer  experiment,  the  ex- 
perimenter  varies  the  degree  of  similarity  between  the  solution  to  a  train¬ 
ing  task  and  the  solution  to  a  target  task  and  expects  the  amount  of  trans¬ 
fer  to  vary  accordingly.  The  negative  Hndings  from  studies  of  isomor¬ 
phic  problems  command  attention  because  the  solutions  to  those  prob¬ 
lems  are  structurally  identical  and  so  ought  to  yield  perfect  transfer. 

However,  the  hypothesis  that  generality  resides  in  declarative 
knowledge  implies  that  structural  similarity  between  solution  paths  is  ir¬ 
relevant.  The  important  factor  is  whether  two  skills  share  a  common 
conceptual  rationale.  If  the  skills  required  to  solve  two  tasks  A  and  B 
can  both  be  derived  from  a  set  of  beliefs  or  abstract  principles  T,  then 
knowing  T  should  give  the  ability  to  solve  both  A  and  B.  The  fact  that 
two  different  procedures  have  the  same  theoretical  rationale  does  not 
imply  that  there  is  any  formal  or  structural  similarity  between  the  prob¬ 
lem  solutions  generated  by  those  procedures.  For  example,  a  chemical 
analysis  of  an  unknown  compound  and  a  synthesis  of  a  particular  sub¬ 
stance  are  procedurally  different,  but  both  are  based  on  the  same  theory 
of  the  composition  of  matter. 


Simulating  Transfer  of  Trainii^.  The  skill  acquisition  litera¬ 
ture  contains  few  studies  of  procedurally  different  skills  which  have  the 
same  declarative  rationale,  but  developmental  psychologists  have  found 
a  naturally-occurring  instance  of  this  type  of  situation.  Gelman  and 
Gallistel  (1978)  have  argued  that  children  learn  to  count  sets  of  objects 
by  deriving  the  correct  counting  procedure  from  their  intuitive  under¬ 
standing  of  its  rationale.  They  formulated  the  declarative  knowledge  re¬ 
quired  for  c(»Tect  counting  into  a  set  of  well-defined  counting  principles 
and  presented  empirical  evidence  that  children  know  these  principles  at 
the  time  they  learn  how  to  count.  Knowledge  of  the  counting  principles 
should  give  the  abiV  t  j  construct  not  only  the  procedure  for  the  stan¬ 
dard  counting  task,  but  to  solve  two  non-standard  counting  tasks  as 
well:  to  count  objects  in  a  particular  order,  so-called  ordered  counting, 
and  to  count  objects  in  such  a  way  that  a  designated  object  is  assigned  a 
designated  number  (e.,  g.,  "count  the  objects  so  that  the  red  object  be¬ 
comes  the  fifth  one"),  so-called  constrained  counting.  The  empirical 
evidence  confirms  that  children  can  quickly  generate  the  correct  proce¬ 
dures  for  these  non-standard  counting  tasks  (Gelman  &  Meek,  1983, 
1986;  Gelman,  Meek,  &  Merkin,  1986). 

To  simulate  this  situation,  the  HS  model  was  given  a  problem  space 
for  the  task  of  counting  a  given  set  of  objects.  The  representation  in¬ 
cluded  symbols  for  objects,  numbers,  for  relations  between  objects  and 
numbers,  and  so  on.  llie  actions  included  to  select  an  objea,  to  select  a 
number,  and  to  assign  a  number  to  an  objea.  Figure  12  summarizes  the 
problem  space  fur  counting. 

The  model  was  given  rules  which  knew  how  to  apply  the  opera¬ 
tors,  but  which  did  not  know  how  to  apply  them  correctly.  Finally,  the 
model  was  given  the  counting  principles  in  the  form  of  constraints. 
Figure  13  summarizes  the  counting  simulation.  More  detailed  reports  are 
available  in  Ohlsson  and  Rees  (1991a,  1991b). 

The  model  was  trained  on  each  of  the  three  procedures  for  standard 
counting,  ordered  counting,  and  constrained  counting.  The  diagonal  of 
Table  1  shows  the  results  as  reported  in  Ohlsson  and  Rees  (1991b).  The 
effort  required  was  measured  in  two  ways,  by  the  number  of  production 
system  cycles  and  by  the  number  of  learning  events,  i.  e.,  rule  revi¬ 
sions.  The  model  learned  each  procedure  in  approximately  the  same 


Representation 

Symbols  for  objects,  numbers,  sets  of  objects,  and  associations 
between  numbers  and  either  objects  or  sets.  Both  objects  and 
numbers  can  have  the  properties  of  being  yirst,  current,  and  point  of 
origin,  and  numbers  can  have  the  property  of  being  the  answer.  The 
relations  represented  are  correspondency,  set  membership, 
successor,  and  temporal  contiguity. 

Initial  state 

A  set  of  objects  to  be  counted. 

Operators 

Associate  a  number  with  an  object,  associate  a  number  with  a  set, 
select  a  first  object,  select  the  next  object,  select  the  first  number, 
selea  the  next  number,  and  shift  focus. 

Goal  state 

A  number  designated  as  the  cardinality  of  the  given  set. 

Figure  12.  A  problem  space  for  counting. 

number  of  learning  events.  This  result  illustrates  the  generality  of 
declarative  knowle^e.  A  single  set  of  abstract  principles  gave  the  model 
the  ability  to  construct  three  different  procedures,  each  procedure  being, 
in  a  sense,  derived  fiom  those  principles  during  practice.  The  declarative 
knowledge  transferred  from  one  counting  task  to  another,  even  though 
the  tasks  are  procedurally  different. 

Switching  between  counting  tasks  is  an  instance  of  near  transfer. 
Not  ail  rules  need  to  be  revised.  Hence,  the  theory  predicts  that  there 
will  be  procedural  transfer  as  well.  To  illustrate  this,  six  transfer  exper¬ 
iments  were  run  with  the  model.  In  each  experiment,  the  model  first 
practiced  one  of  the  three  counting  tasks  until  it  reached  mastery  and 
then  it  was  switched  to  either  of  the  other  two  tasks.  The  results  are 
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Prior  procedural  knowledge 

The  system  began  with  one  rule  for  each  operator.  That  rule  applied 
the  operator  whenever  possible,  i.  e.,  in  every  situation  in  which  its 
applicability  conditions  were  satisfied.  The  result  was  counting-like 
t»t  chaotic  behavicK*. 

Prior  declarative  knowledge 

The  were  18  constraints  that  encode  the  counting  principles  as 
identified  by  Gelman  and  Gallistel  (1978):  The  one-to-one  mapping 
principle,  the  cardinal  principle,  and  the  stable  order  principle. 

Training 

The  model  was  given  unsupervised  practice  on  sets  of  3-5  objects. 

Learning  outcomes 

The  model  learned  a  correct,  general  procedure  for  counting  any  set 
of  objeas,  regardless  of  the  size  of  the  set  and  the  type  of  objects 
involved.  It  also  learned  correct  procedures  for  two  non-standard 
counting  tasks,  ordered  counting  and  constrained  counting.  Finally, 
the  model  transfered  each  of  the  three  learned  counting  procedures  to 
each  of  the  other  two  counting  tasks. 

Figure  13.  Summary  of  the  counting  simulation. 

shown  in  the  off-diagonal  cells  of  Table  1.  The  model  solved  each  trans¬ 
fer  task  successfully.  The  amount  of  transfer  varied  depending  on  the 
exaa  relations  between  the  rules  for  the  praaice  task  and  the  rules  for 
the  target  task.  The  model  also  predicts  asymmetric  transfer  between 
some  tasks.  For  example,  the  transfer  from  ordered  to  constrained 
counting  was  0%,  while  the  transfer  from  constrained  to  ordered 
counting  was  75%.  These  predictions  are,  in  principle,  empirically 
testable,  although  the  necessary  data  are  not  available  at  this  time. 


Table  1.  The  computational  effort  required  by  the  HS  model  to  learn 
each  of  three  counting  tasks  (diagonal  cells)  and  to  solve  each  of  six 
different  transfer  tasks  (off-diagonal  cells)  “ 


Transfer  task 

Training  task 

Standard 

counting 

Ordered 

counting 

Constrained 

counting 

^t^ndard  counting 

Rule  revisions 

12 

2 

2 

Prod.  sys.  cycles 

854 

110 

127 

Rule  revisons 

1 

11 

11 

Prod.  sys.  cycles 

184 

262 

297 

Constrained  counting 

Rule  revisions 

0 

3 

12 

Prod.  sys.  cycles 

162 

154 

451 

“Data  taken  fix)m  Ohlsson  and  Rees  (1991b,  Tables  1  and  2). 


Contrary  to  Singley  and  Anderson  (1989),  the  present  theory  does 
not  imply  that  the  amount  of  transfer  is  predictable  from  the  number  of 
overlapping  production  rules.  Instead,  the  variable  of  interest  is  the 
amount  of  cognitive  work  that  has  to  be  performed  in  order  to  adapt  the 
rules  to  the  target  task.  A  single  mle  from  the  training  task  might  cause 
more  than  one  type  of  error  in  the  target  task  and  need  to  be  revised 
more  than  once,  so  the  number  of  rules  that  need  to  be  revised  is  prob¬ 
ably  too  course  a  predictor  variable.  The  present  analysis  suggests  that 
the  number  of  rule  revisions  is  a  better  predictor. 

Unlike  the  number  of  overlapping  production  mles,  the  number  of 
rule  revisions  required  to  master  the  target  task  cannot  be  calculated 
from  a  static  comparison  of  the  two  procedures.  It  is  a  function  of  the 
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particular  learning  mechanism  which  carries  out  the  revisions.  To  verify 
this,  the  knowledge  compilation  mechanism—  a  cornerstone  of  the  ACT 
model  described  by  Anderson  (1983)— was  implemented  within  the  HS 
architecture.  According  to  the  knowledge  compilation  hypothesis, 
declarative  knowledge  resides  in  long-term  memory  in  a  format  similar 
to  inference  rules.  Familiar  problems  are  solved  with  production  rules, 
but  unfamiliar  problems  are  solved  by  interpreting  (in  the  computer  sci¬ 
ence  sense)  the  declarative  knowledge.  During  interpretation,  new  pro¬ 
duction  rules  are  constructed  which  eliminates  the  need  to  re-interpret 
the  declarative  knowledge  on  subsequent  trials.  Once  the  rules  are  con¬ 
structed,  they  are  composed  into  larger  rules  which  solve  the  relevant 
task  more  efficiently.  Unlike  HS,  the  ACT  model  learns  from  suc¬ 
cesses,  not  from  errors. 

We  did  not  implement  the  ACT  architecture  as  described  in 
Anderson  ( 1983).  Instead,  we  implemented  the  knowledge  compilation 
mechanism  within  the  HS  architecture.  The  result  was  a  version  of  HS 
which  learns  through  knowledge  compilation  instead  of  through  con¬ 
straint  violations.  All  other  aspects  of  the  HS  architecture  were  kept  the 
same.  I  shall  refer  to  the  HS  architecture  as  the  KC  model  when  it  learns 
through  knowledge  compilation.  The  upshot  is  that  we  have  two  simu¬ 
lation  models,  HS  and  KC,  which  learn  in  different  ways  but  which  are 
otherwise  identical.  This  provides  an  oppominity  to  compare  the  behav¬ 
ioral  prediaions  of  the  two  learning  mechanisms. 

KC  was  given  the  counting  principles  in  the  form  of  declarative 
knowledge  and  was  then  run  through  the  same  set  of  learning  experi¬ 
ments  and  transfer  experiments  as  HS.  Because  the  effort  measures  dif¬ 
fered  by  an  order  of  magnitude  (KC  was  on  the  average  ten  times 
slower  than  HS),  they  have  been  converted  into  transfer  scores.  There 
are  many  different  ways  to  measure  transfer  (Singely  &  Anderson, 
1989,  pp.  37-41).  The  index  used  here  was 


*  100 


(2) 


where  is  the  cognitive  effm  required  to  master  the  target  task  B  from 
scratch  and  is  the  effort  required  to  master  B  given  previous  mas¬ 
tery  of  training  task  A.  The  T  index  can  be  interpreted  as  the  proportion 
of  the  effort  required  to  learn  task  B  that  is  saved  by  first  learning  task 
A.  It  is  equal  to  zero  when  practice  on  the  training  task  A  is  of  no  help 
and  it  is  equal  to  100  when  practice  on  the  training  task  provides  mas¬ 
tery  of  B  with  no  further  training.  The  index  is  negative  if  practice  on 
task  A  increases  the  amount  of  effort  required  to  master  B. 

The  effort  measures  for  the  transfer  experiments  with  the  HS  and 
KC  models  were  converted  to  transfer  scores.  The  results  are  shown  in 
Table  2.  The  amount  of  transfer  predicted  by  the  HS  model  varied  be¬ 
tween  8  and  100  across  tasks,  while  the  transfer  predicted  by  the  KC 


Table  2.  Transfer  scores  for  the  HS  and  KC®  models  for  each  of  six 
transfer  tasks  in  the  counting  domain,  based  on  both  the  number  of  rule 
revisions  and  the  number  of  production  system  cycles. 


Effort  measure 

Transfer 

fix)m-to 

Rule  revisions 

Prod.  sys.  cycles 

HS 

KC 

HS 

KC 

Standard-Ordered 

82 

100 

58 

100 

Standard-Constrained 

83 

72 

72 

63 

Ordered-Standard 

92 

34 

78 

19 

Ordered-Constrained 

8 

34 

34 

29 

Constrained-Standard 

100 

97 

81 

81 

Constrained-Ordered 

67 

97 

41 

73 

®KC  is  an  acronym  for  knowledge  compilation. 


model  varied  between  19  and  100.  More  importantly  for  present  pur¬ 
poses,  the  two  models  made  different  transfer  predictions  for  one  and 
the  same  task.  HS  predicts  a  score  of  92  for  the  transfer  from  ordered  to 
standard  counting,  while  the  corresponding  KC  prediction  is  34.  More 
important  still,  the  differences  between  tasks  do  not  always  go  in  the 
same  direction  for  the  two  models.  HS  predicts  that  transfer  from  stan¬ 
dard  to  ordered  counting  is  easier  than  vice  versa,  while  KC  predicts  the 
opposite.  The  results  in  Table  2  verify  the  fact  that  the  amount  of  near 
transfer  between  two  tasks  cannot  be  predicted  from  a  static  analysis  of 
the  procedures  for  those  tasks,  but  depends  upon  assumptions  about 
learning. 


APPLICATION  TO  INSTRUCTION 

A  good  theory  should  have  implications  for  practice.  The  natural 
application  domain  for  a  learning  theory  is  the  design  of  instruction.  The 
insmictional  implications  of  the  present  theory  include  an  explanation  of 
why  it  is  possible  to  learn  from  instruction,  a  rationale  for  the  most 
common  tutoring  scenario,  a  prescription  for  effective  tutoring  mes¬ 
sages,  and  a  technology  for  evaluating  instructional  designs  through 
simulated  one-on-one  tutoring. 

Why  Instruction  is  Possible 

Why  are  people  able  to  learn  from  instmction?  Although  the  origin 
of  cognitive  capacities  such  as  language  and  learning  is  almost  com¬ 
pletely  unknown,  it  is  likely  that  learning  evolved  before  language. 
There  are  no  mammalian  species,  and  probably  no  lower  organisms, 
which  cannot  learn,  so  the  capacity  to  learn  was  almost  certainly  present 
in  the  hominids  when  they  separated  from  the  rest  of  the  primates  4-10 
millions  of  years  ago. 

Language,  on  the  other  hand,  evolved  later,  perhaps  very  much 
later.  McCrone  (1992)  summarizes  the  fossil  evidence  in  the  following 
way:  "The  high  arch  in  the  roof  of  the  mouth  that  helps  with  voice  pro¬ 
duction  is  about  the  only  telltale  sign  of  speech  that  shows  up  on  a  fossil 


skeleton.  This  arch  did  not  start  to  appear  until  Homo  erectus  arrived 
about  1.5  million  years  ago,  and  even  then  the  arch  was  slight.  Judging 
fipom  fossils,  modem  speech  came  along  about  100,000  years  ago  when 
the  earliest  examples  of  Homo  sapiens  were  starting  to  walk  the  earth." 
(p.  160-161)3  One  hundred  thousand  years  is  a  short  time  in  evolution¬ 
ary  terms.  If  this  estimate  is  correct,  then  special-purpose  brain  mecha¬ 
nisms  for  learning  fix)m  verbal  instraction  have  had  little  time  to  evolve. 

These  two  speculative  but  plausible  hypotheses-that  learning  pro¬ 
ceeded  language  and  that  language  is  too  recent  for  special-purpose 
brain  mechanisms  for  instruction  to  have  evolved-imply  that  our  ability 
to  learn  without  instruction  is  primary  and  our  ability  to  leam  from  in¬ 
struction  secondary  and  parasitic  upon  the  former.  A  theory  of  learning 
from  instruction  should  therefore  explain  how  instruction  feeds  into 
learning  mechanisms  that  evolved  for  the  purpose  of  uninstmcted 
learning. 

The  theory  proposed  in  this  chapter  suggests  such  an  explanation. 
The  two  functions  of  deteaing  and  correcting  errors  can  be  computed  by 
noticing  contradictions  and  by  inferring  the  conditions  that  produced 
them  as  described  previously,  but  they  can  also  be  computed  in  other 
ways.  Instruction  works,  the  theory  suggests,  because  being  told  that 
one  has  committed  an  error  is  functionally  equivalent  to  detecting  the  er¬ 
ror  oneself  and  because  being  told  the  cause  of  an  error  is  functionally 
equivalent  to  figuring  out  the  cause  oneself.  Learning  from  instruction  is 
possible  because  instractional  messages  enter  into  the  learning  process 
in  the  same  way,  fiinaionally  speaking,  as  declarative  knowl^ge  re¬ 
trieved  from  long-term  memory. 

A  Rationale  for  One-on-One  Tutoring 

The  theory  proposed  in  this  chapter  implies  that  there  are  three  ma- 
jOT  felicity  conditions  (VanLehn,  19%,  p.  23)  for  effective  insmiction  in 
cognitive  skills:  (a)  instruction  should  be  offered  during  ongoing  prac¬ 
tice,  (b)  instniaion  should  alert  the  learner  to  errors,  and  (c)  instruction 
should  identify  the  conditions  which  caused  the  error.  The  type  of  in- 


3See  Lyons  (1988,  p.  IS3)  for  a  different  interpretation  of  the  evidence. 


struction  that  satisfies  these  three  conditions  is  entirely  familiar.  In  one- 
on-one  tutoring,  the  teacher  watches  as  the  learner  practices,  points  out 
errors,  and  helps  the  learner  correct  them.  The  present  theory  selects  as 
most  felicitous  precisely  that  type  of  instmction  which  the  empirical  data 
show  is  most  effective  (Bloom,  1984). 

Intelligent  tutoring  systems  are  typically  designed  to  teach  cognitive 
skills  (Psotka,  Massay,  &  Mutter,  1988;  Sleeman  &  Brown,  1982). 
Perhaps  the  most  successful  line  of  intelligent  tutoring  systems  are  the 
so-called  model  tracing  tutors  developed  by  John  Anderson  and  co¬ 
workers  (Anderson  et  al.,  1987,  1990).  Skill  training  tutors  in  general 
and  model  tracing  tutors  in  particular  conform  closely  to  the  three  felicity 
conditions;  They  give  feedback  in  the  context  of  practice,,  they  alen  the 
learner  to  errors,  and  they  help  the  learner  to  correct  the  error. 

The  design  of  the  model  tracing  tutors  is  said  to  be  derived  from  the 
ACT  theory  of  learning  (Anderson  et  al.,  1987).  However,  none  of  the 
six  learning  mechanisms  described  in  various  versions  of  the  ACT  the¬ 
ory-analogical  transfer,  discrimination,  generalization,  proceduraliza- 
tion,  rule  composition,  and  strengthening-can  take  a  tutoring  message 
as  input  and  revise  a  faulty  production  rule  accordingly.  Analogical 
transfer  generates  task  relevant  activity  by  relating  the  current  problem  to 
an  already  solved  problem;  mle  composition  creates  more  efficient  rules 
by  combining  existing  (hopefully  ccxrect)  rules;  strengthening  increases 
the  probability  that  a  (hopefully  correct)  rule  will  be  retrieved.  These 
three  learning  mechanisms  can  neither  take  a  tutoring  message  as  input 
nor  revise  an  existing  rule.  Generalization  and  discrimination  (which  do 
not  loom  large  in  expositions  of  the  ACT  theory)  revise  existing  rules, 
but  cannot  take  a  tutoring  message  as  input.  Proceduralization  generates 
new  rales  on  the  basis  of  verbal  input,  but  cannot  correct  existing  rales. 
Taken  literally,  the  ACT  theory  predicts  that  it  is  impossible  to  learn 
from  the  teaching  scenario  embodied  in  the  model  tracing  tutors.  Unless 
people  can  learn  in  other  ways  than  those  described  in  the  ACT  theory, 
they  have  no  cognitive  mechanisms  for  learning  from  feedback  mes¬ 
sages  about  errors. 

To  highlight  the  contrast  between  the  implications  of  the  ACT  the¬ 
ory  and  the  design  of  the  model  tracing  tutors,  consider  what  an  intelli¬ 
gent  tutoring  system  derived  from  the  ACT  theory  might  be  like.  In  or- 
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der  to  facilitate  analogical  transfer,  such  a  system  might  keep  a  record  of 
the  problems  the  student  has  solved  in  the  past  and  suggest  possible 
analogies  when  the  student  hesitates.  Such  a  system  might  repeat  the 
task  instructions  fixnn  time  to  time  to  give  the  student  a  chance  to  re-pro- 
ceduralize  them.  It  might  sequence  practice  problems  in  such  a  way  that 
rule  composition,  generalization,  and  discrimination  are  facilitated. 
Finally,  it  might  provide  opportunities  to  exercise  already  acquired  com¬ 
ponents  of  the  target  skill  in  order  to  increase  their  strengths.  However, 
a  mtoring  system  derived  firom  the  ACT  theory  would  have  no  reason  to 
alert  the  learner  to  errors  and  give  help  in  correaing  them. 

The  model  tracing  tutors  and  most  other  skill  training  systems  con¬ 
form  to  the  design  that  follows  from  the  theory  presented  in  this  chapter: 
They  help  the  learner  detect  and  ccxrect  errors.  The  instmctional  success 
of  such  tutors  provide  support  for  the  hypothesis  that  error  correction  is 
the  natural  modus  operandi  of  skill  acquisition.  If  it  were  not,  those  tu¬ 
tors  would  not  be  effective  but  empirical  evaluations  show  that  they  are 
(Anderson  et  at.,  1990,  pp.  30-33).  In  short,  the  present  theory  pro¬ 
vides  a  rationale  for  the  teaching  scenario  adopted  by  designers  of  intel¬ 
ligent  tutoring  systems  and  in  turn  receives  empirical  support  from  the 
instruaional  success  of  such  systems. 

Deriying  the  Content  of  Instruction  from  Theory 

The  content  of  feedback  messages  is  the  Achilleus  heel  of  skill¬ 
monitoring  tutoring  systems.  Delivering  feedback  messages  is  the  major 
instruaional  action  of  such  a  system,  so  its  instruaional  effectiveness 
depends  crucially  on  the  content  of  those  messages.  Until  now  there  has 
been  no  theory  for  how  to  formulate  feedback  messages.  Such  mes¬ 
sages  are  typically  pre-formulated  texts  and  they  are  written  in  the  same 
way  as  other  instructional  materials:  The  instructional  designer  makes  a 
guess  about  what  might  work  based  on  his  or  her  understanding  of  the 
subject  matter  In  spite  of  the  strong  claims  about  the  tight  relation  be¬ 
tween  the  ACT  theory  and  the  model-tracing  tutors  (Anderson  et  al., 
1987),  this  is  true  of  those  tutors  as  well.  No  existing  tutoring  system 
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derives  the  content  of  its  tutoring  messages  from  assumptions  about 
learning. 

The  learning  theory  proposed  here  implies  that  tutoring  messages 
should  help  the  student  i^ntify  those  properties  of  the  current  problem 
state  which  indicate  that  an  error  has  been  committed,  so  that  he  or  she 
can  detect  his  or  her  errors  without  help  in  the  future.  The  general  form 
for  this  type  of  tutoring  message  is  "you  can  tell  that  you  just  made  an 
error,  because  of  P",  where  P  is  some  conjunction  of  easily  accessed 
properties  of  the  problem  state  produced  by  the  erroneous  action. 
Unless  the  learner  can  detect  his  or  her  errors,  he  or  she  cannot  learn 
from  them. 

More  importantly,  mtoring  messages  should  help  the  learner  correct 
his  or  her  errors.  To  do  so,  a  message  must  identify  those  properties  of 
the  immediately  preceeding  problem  state  that  constitute  counterindica¬ 
tions  to  the  problem  solving  step  that  the  student  took.  A  problem  solv¬ 
ing  step  A  is  typically  correct  in  some  situations  but  wrong  in  others. 
The  task  of  the  student  is  to  figure  out  when,  i.  e.,  in  which  situations, 
doing  A  is  right  and  when  it  is  wrong.  If  doing  A  in  situation  S  is  incor¬ 
rect,  then  the  corresponding  tutoring  message  should  have  the  general 
form  "when  such-and-such  conditions  are  the  case,  A  is  not  the  right 
thing  to  do".  The  conditions  mentioned  in  the  message  should  refer  to 
the  immediately  preceeding  problem  situation,  not  to  the  situation  in 
which  the  error  was  discovered.  The  student  needs  to  learn  to  avoid  the 
error,  i.  e.,  to  act  differently  in  the  simation  in  which  he  or  she  decided 
to  do  A.  The  tutoring  system  should  therefore  back  up  and  explain  what 
makes  A  the  wrong  choice  in  that  situation. 

These  prescriptions  rule  out  some  types  of  feedback  messages 
which  are  commonly  used  in  tutoring  systems.  For  example,  it  is  intu¬ 
itively  plausible  that  if  a  student  takes  step  A  when  he  or  she  should 
have  taken  step  B,  then  it  helps  to  print  a  message  of  the  form  "you  did 
A  but  you  should  have  done  B."  According  to  the  present  theory,  how¬ 
ever,  this  type  of  feedback  message  is  likely  to  be  ineffective,  because  it 
does  not  specify  the  conditions  under  which  either  A  or  B  should  or 
should  not  be  done.  Instruction  should  focus  on  the  conditions  of  ac¬ 
tions,  not  on  the  actions  themselves.  A  second  common  type  of  feed¬ 
back  message  explains  what  is  wrong  with  the  situation  in  which  the  er- 
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ror  was  discovered,  i.  e.,  why  the  error  is  an  error.  This  might  increase 
the  student's  understanding  of  the  domain  but  it  is  unlikely  to  help  him 
or  her  acquire  the  target  skill,  because  it  does  not  tell  him  or  her  how  to 
avoid  the  error.  A  feedback  message  should  focus  on  the  situation  in 
which  the  student  decided  to  do  A,  not  on  the  situation  produced  by 
doing  A. 

Simulating  One-on-One  Tutoring 

The  HS  model  can  be  interpreted  as  a  model  of  learning  from  nitor- 
ing,  with  the  constraints  playing  the  role  of  tutoring  messages.  It  is  a 
matter  of  interpretation  whether  the  constraints  correspond  to  knowledge 
items  retrieved  from  memory,  conclusions  from  inference  chains,  or 
tutoring  messages  received  through  the  language  comprehension  chan¬ 
nel.  According  to  the  theory  proposed  here,  these  three  types  of  knowl¬ 
edge  elements  enter  into  the  learning  i^'ocess  in  the  same  way. 

A  runnable  simulation  of  learning  from  tutoring  opens  up  novel 
possibilities.  We  can  evaluate  alternative  instruaional  designs  by  teach¬ 
ing  them  to  the  model  and  measuring  the  amount  of  computational  work 
it  has  to  expand  to  team  the  target  skill  under  different  circumstances.  If 
the  model  can  reach  mastery  with  less  work  under  one  instructional  de¬ 
sign  or  tutoring  regime  than  another,  then  that  is  evidence  that  the  for¬ 
ma*  is  the  betta  design. 

To  explore  this  possibility  the  HS  model  was  tutored  in  subtraction. 
The  simulation  experiment  followed  the  common  classroom  tactic  of 
teaching  the  procedure  for  canonical  subtraction  problems,  i.  e.,  prob¬ 
lems  in  which  each  subtrahend  digit  is  smaller  than  the  minuend  digit  in 
the  same  column,  and  to  introduce  the  procedure  for  how  to  handle  non- 
canonical  columns,  i.  e.,  columns  in  which  the  subtrahend  digit  is  larger 
than  the  minuend  digit,  once  the  procedure  for  canonical  subtraction  has 
been  mastered  (Leinhardt,  1987;  Leinhardt  &  Ohlsson,  1990).  The 
simulation  experiment  followed  this  pedagogical  tactic  in  that  the  model 
was  first  given  a  procedure  for  non-canonical  subtraction  and  was  then 
tutored  in  canonicalization. 

Two  different  HS  models  of  canonical  subtraction  were  imple¬ 
mented.  One  model,  called  the  high-knowledge  model,  was  built  around 


a  representation  of  the  place  value  meaning  of  digits.  In  this  representa¬ 
tion  the  digit  3  was  represented  as  (3  *  10)  if  it  appeared  in  the  second 
column  to  the  right,  as  (3  *  100)  if  it  appeared  in  the  third  colunm,  and 
so  on.  The  operations  by  which  this  representation  was  manipulated 
correspond  to  mathematically  motivated  operations  on  numbers.  The 
high-lmowledge  model  was  intended  to  simulate  skill  acquisition  in  the 
context  of  conceptual  understanding  of  place  value. 

The  second  model,  called  the  low  knowledge  model,  was  built 
around  a  representation  in  which  a  subtraction  problem  is  a  two-dimen¬ 
sional  array  of  digits.  In  this  representation,  the  digit  3  was  represented 
as  the  digit  3  regardless  of  its  position  in  the  problem  display.  The  op¬ 
erations  by  which  this  representation  was  manipulated  correspond  to 
physical  operations  on  digits  rather  than  conceptual  or  mathematical  op¬ 
erations  on  numbers.  The  low  knowledge  model  was  intended  to  simu¬ 
late  rote  learning  of  subtraction.  Figure  14  summarizes  the  problem 
space  for  sulxraction.  The  reader  is  referred  to  Ohlsson,  Ernst,  and  Rees 
(1992)  for  a  full  account. 

Both  the  high  and  the  low  knowledge  models  were  nitored  in  the 
regrouping  algorithm  taught  in  American  schools.  In  this  method  non- 
canonical  columns  are  handled  by  imn^menting  the  minuend  of  the  non- 
canonical  column  and  performing  a  corresponding  decrement  on  the 
minuend  in  the  next  column  to  the  left.  Both  models  were  also  tutored  in 
the  equal  addition  algorithm  taught  in  some  European  schools.  In  this 
method  non-canonical  columns  are  handled  by  incrementing  the  minu¬ 
end  in  the  non-canonical  column  and  decrementing  the  subtrahend  in  the 
next  column  to  the  left.  The  simulation  experiment  thus  followed  a  2-by- 
2  design,  with  two  levels  of  knowledge  paired  with  two  different  target 
skills. 

The  procedure  for  tutoring  the  model  were  similar  to  those  involved 
in  tutoring  a  human  student.  The  programmer  in  charge  of  the  system 
watched  while  the  model  tried  to  solve  a  non-canonical  problem,  spotted 
errors,  halted  the  model,  and  typed  in  a  constraint  (tutoring  message) 
intended  to  correa  the  observed  error.  When  the  model  had  attained 
mastery,  it  was  reinitialized  and  run  again  with  all  the  constraints  in 
place  simultaneously,  to  verify  that  they  were  indeed  sufficient  to  pro¬ 
duce  correct  performance.  This  tutoring  scenario  was  carried  out  four 
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Representations 

Two  different  representations  for  subtraction  were  created,  (a)  The 
procedural  or  low  knowledge  representation  contained  symbols  for 
written  digits,  perceived  digits,  spatial  locations,  scratch  marks, 
decrements,  and  increments,  (b)  The  conceptual  or  high  knowledge 
representation  contained,  in  addition,  symbols  for  subtraction 
problems,  numbers,  place  values,  links  between  numbers  and  digits, 
relations  between  numbers,  and  answers. 

Initial  state 

A  subtraaion  problem. 

Operators 

Look  at  a  digit,  move  the  eye  to  another  digit,  write  a  digit,  aoss  out 
a  digit,  assert  the  answer,  recall  number  fact,  create  a  working 
memory  schema,  and  revise  a  working  memory  schema. 

Goal  state 

A  number  designated  as  the  answer  to  the  subtraction  problem. 

Figure  14.  A  problem  space  for  subtraction. 

times,  once  for  each  combination  of  knowledge  level  and  target  skill. 
The  amount  of  computational  work  required  to  attain  mastery  in  each 
condition  was  recorded.  Figure  15  summarizes  the  subtraaion  simula¬ 
tion. 

Table  3  shows  the  number  of  production  system  cycles  and  the 
number  of  rule  revisions  required  for  HS  to  attain  mastery  in  each  of  the 
four  conditions.  There  are  two  main  results.  The  high  knowledge  model 
required  more  work  to  attain  mastery  than  the  low  knowledge  model. 
This  is  true  for  both  canonicalization  procedures  and  for  both  effort 


Prior  procedural  knowledge 

The  system  knew  at  the  outset  how  to  solve  a  canonical  subtraction 
problem,  i.  e.,  a  problem  in  which  the  subtrahend  digit  is  smaller 
than  the  minuend  digit  in  every  column. 

Training 

The  model  was  tutored  in  how  to  handle  non-canonical  problems.  It 
executed  its  procedure  for  canonical  prc^lem  until  it  made  an  error.  It 
was  then  halted  and  given  a  constraint  that  was  intended  to  allow  it  to 
correct  the  error. 

Learning  outcome 

The  model  learned  two  different  procedures  for  non-canonical 
columns,  namely  regrouping  and  equal  addition,  with  both  the  low 
knowledge  and  the  high  teowledge  representations. 


Figure  15.  Summary  of  the  subtraction  simulation. 

measures.  The  reason  for  this  result  is  that  the  high  knowledge  model 
had  a  more  elaborate  representation.  It  requires  more  cognitive  opera¬ 
tions  to  create  and  update  a  more  elaborate  representation  and  each  op¬ 
eration  must  be  guided  by  some  production  rule.  Hence,  the  high 
knowledge  model  had  more  to  learn. 

The  second  result  is  that  learning  the  regrouping  procedure  required 
more  work  than  learning  the  equal  addition  procedure  in  both  the  high 
knowledge  and  the  low  knowledge  conditions.  The  reason  for  this  result 
is  that  the  control  of  the  regrouping  procedure  becomes  complicated 
when  it  is  necessary  to  regroup  the  minuend  recursively  to  handle 
blocking  zeroes,  i.  e.,  minuend  zeroes  immediately  to  the  left  of  a  non- 
canonical  column.  The  augmenting  procedure  is  not  affected  by  the 
number  of  blocking  zeroes.  The  regrouping  procedure  also  requires 
more  complicated  visual  attention  allocation. 
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Table  3.  The  computational  effort  required  for  the  HS  model  to  master 
regrouping  and  augmenting  with  either  a  high  knowledge  or  a  low 
knowledge  representation." 

Type  of  representation 


High  knowledge  Low  knowledge 
Method  _  _ 

learned  Cycles  Revisions  Cycles  Revisions 


R^grQpping 


W/o  blocking  zeroes* 

940 

23 

449 

16 

With  blocking  zeroes 

1815 

32 

794 

24 

Ausmentma 

W/o  blocking  zeroes 

862 

20 

687 

18 

With  blocking  zeroes 

862 

20 

687 

18 

"Data  taken  fiom  Ohlsson,  Ernst,  and  Rees  (1992,  Table  2). 

*1.  e.,  minuend  zeroes  to  the  left  of  a  non-canonical  column. 

Both  of  these  results  were  unexpected  because  they  contradict  the 
common  belief  among  mathematics  educators  that  regrouping  is  easier  to 
learn,  particularly  in  the  high  knowledge  condition.  This  belief  is  based 
on  empirical  investigations  carried  out  in  the  pre-Word  War  II  era 
(Brownell,  1947;  Brownell  &  Moser,  1949).  A  detailed  discussion  of 
these  simulation  results  and  their  relation  to  the  empirical  research  has 
been  presented  elsewhere  (Ohlsson,  1992a). 

This  simulation  exercise  demonstrates  that  runnable  models  of 
learning  from  instmction  creates  new  relations  between  learning  theory 
and  instructional  design  (Ohlsson,  1992a).  Instead  of  deriving  general 
design  principles  from  the  learning  theory,  as  suggested  by  Bruner 
(1966)  and  later  by  Glaser  (1976,  1982),  we  can  evaluate  an  instruc¬ 
tional  design  directly  by  teaching  it  to  a  simulation  model.  This  technol- 


ogy  has  the  potential  to  allow  instructional  designers  to  do  formative 
evaluation  without  leaving  their  desks  (Ohlsson,  1992b). 

Thi^  simulation  exercise  also  demonstrates  that  the  application  of 
learning  theory  to  education  requires  a  formal  analysis  of  instruction.  A 
model  of  learning  cannot  have  implications  for  instruction  unless  it 
contains  learning  mechanisms  which  take  instruction,  suitably  formal¬ 
ized,  as  one  of  their  inputs.  Computational  analysis  of  instruction  has 
barely  begun.  Some  early  Artificial  Intelligence  systems  explored  how  a 
system  can  learn  from  advice  and  instructions  (Hayes-Roth,  Klahr,  & 
Mostow,  1981;  Mostow,  1983;  Rychener,  1983),  but  the  problem  ap¬ 
pears  to  have  disappeared  from  the  research  agenda  of  the  machine 
learning  community.  The  proceduralization  mechanism  in  the  ACT 
model  (Anderson,  1983)  was  a  first  attempt  to  formalize  this  problem  in 
a  psychological  context.  Although  the  [X'oceduralization  mechanism  ex¬ 
plains  how  the  learner  constructs  new  rules  on  the  basis  of  task  instmc- 
tions,  it  does  not  explain  how  the  learner  revises  existing  rules  on  the 
basis  of  feedback  messages.  The  Sierra  and  Cascade  models  described 
by  VanLehn  (1990)  and  VanLehn  and  Jones  (this  volume)  learn  from 
solved  examples-a  common  form  of  instruction-but  they  cannot  take 
tutoring  messages  as  inputs.  The  HS  model  constitutes  a  modest  first 
step  towards  a  formal  theory  of  how  tutoring  messages  received  during 
skill  practice  are  translated  into  mental  code  for  the  target  skill. 


SUMMARY  AND  CONCLUSIONS 

The  theory  proposed  in  this  chapter  is  formulated  in  terms  of  cog¬ 
nitive  functions  instead  of  information  processing  mechanisms.  The 
function  of  learning  to  solve  an  unfamiliar  task  is  analyzed  into  two  sub¬ 
functions:  To  generate  learning  opportunities  and  to  construct  new  pro¬ 
cedural  knowledge.  The  latter  function  is  in  turn  broken  down  into  two 
subfunctions:  To  detect  incorrect  problem  solving  steps  and  to  correct 
the  procedural  knowledge  that  generated  them.  To  detea  errors  requires 
a  comparison  between  the  current  problem  state  with  prior  knowledge. 
To  correa  an  error,  finally,  involves  identifying  the  conditions  under 
which  the  error  appears  and  constraining  the  faulty  decision  rule  accord- 


ingly.  The  main  claim  of  the  theory  is  that  this  is  the  right  functional 
breakdown  of  skill  acquisition. 

How  does  this  theory  explain  the  role  of  prior  knowledge  in  skill 
acquisition?  Domain  knowledge  is  not  needed  to  generate  task  relevant 
aaions.  Weak  methods  can  generate  behavior  even  in  the  absence  of  any 
knowledge  about  the  task.  The  functions  for  which  knowledge  is 
needed  are  to  detect  and  correct  errors.  Incorrect  or  incomplete  task  pro¬ 
cedures  are  likely  to  produce  situations  which  contradict  what  ought  to 
be  true  in  the  particular  domain.  Domain  knowledge  increases  the  prob¬ 
ability  that  the  learner  recognizes  that  he  or  or  she  has  made  an  error.  To 
correct  the  error  presupposes  the  ability  to  identify  the  conditions  under 
which  that  error  will  appear.  This  might  require  complicated  reasoning 
about  the  domain.  Prior  knowledge  increases  the  probability  that  the 
learner  identifies  the  causes  of  errors  correctly. 

Each  of  the  functions  postulated  in  the  theory  can  be  computed  by 
many  different  information  processing  mechanisms.  In  the  particular 
implementation  of  the  theory  described  in  this  chapter,  task  relevant  be¬ 
havior  is  generated  by  forward  search  through  the  problem  space. 
Errors  are  detected  by  matching  constraints  against  problem  states  with  a 
pattern  matcher.  The  conditions  that  produced  a  particular  error  are 
identified  by  regressing  the  match  between  a  constraint  and  a  state 
through  a  produaion  rule.  Errors  are  corrected  by  adding  the  conditions 
that  produce  them  to  the  left-hand  sides  of  decision  rules.  Other  imple¬ 
mentations  of  the  functional  theory  are  possible. 

The  simulation  model  generates  several  quantitative  predictions 
about  two  classical  problems  in  learning  theory.  First,  it  predicts  that 
skill  acquisition  is  negatively  accelerated.  More  precisely,  it  predicts  that 
the  learning  curve  follows  a  so-called  power  law.  Second,  the  theory 
predias  zero  transfer  in  far  transfer  situations.  It  also  predicts  that  the 
amount  of  transfer  in  near  transfer  situations  depends  upon  the  particular 
tasks  involved  and  that  transfer  might  be  asymmetrical,  i.  e.,  that  there 
might  be  either  more  transfer  from  task  B  to  task  A  than  from  task  A  to 
task  B.  Finally,  the  model  predicts  that  the  richer  the  representation  of 
the  task  to  be  learned,  the  more  cognitive  effort  is  needed  to  attain  mas¬ 
tery. 
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With  respect  to  the  practical  problem  of  designing  computer-based 
instruaion  in  cognitive  skills,  the  present  theory  provides  a  rationale  and 
an  explanation  for  the  effectiveness  of  one-on-one  tutoring,  the  main 
teaching  scenario  embodied  in  current  intelligent  tutoring  systems. 
Tutoring  (by  computer  or  by  human)  works,  the  theory  of  claims,  be¬ 
cause  tutoring  messages  provide  an  alternative  way  to  become  aware  of 
errors  and  an  alternative  source  of  infCHination  about  the  conditions  un¬ 
der  which  the  errors  occur. 

According  to  the  present  theory,  tutoring  messages  can  help  the 
learner  in  two  ways.  First,  to  help  the  learner  detect  his  or  her  own  er¬ 
rors,  tutoring  messages  should  point  out  those  properties  of  a  problem 
state  which  indicate  that  an  error  has  occurred.  Second,  to  help  the 
learner  correa  his  or  her  errors,  tutoring  messages  should  identify  those 
properties  of  a  problem  state  which  indicate  that  an  error  will  occur  if 
such  and  such  an  action  is  executed. 

The  theory  proposed  here  is  obviously  incomplete.  People  un¬ 
doubtedly  learn  from  their  errors,  but  they  also  learn  from  their  suc¬ 
cesses.  The  theory  needs  to  be  extended  with  assumptions  about  how 
people  leam  from  correa  problem  solving  steps.  It  is  not  clear  which  of 
those  prediaions  will  remain  constant  if  the  model  is  augmented  with 
additional  learning  mechanisms.  The  interaction  between  multiple 
learning  mechanisms  is  a  high-priority  issue  for  computational  learning 
theories.  In  past  work,  I  combined  a  method  for  learning  from  error 
(discrimination)  with  two  methods  for  learning  from  success 
(generalization  and  subgoaling).  The  resulting  model  learned  to  solve 
simple  puzzle  tasks  (Ohlsson,  1987a),  but  it  threw  no  light  on  the 
problem  of  prior  knowledge. 

The  problem  of  how  prior  knowledge  impacts  learning  is  central  for 
the  study  of  skill  acquisition.  The  outcome  of  practice  is  always  a  func¬ 
tion  of  both  the  learner’s  prior  knowledge  about  the  domain  and  the  new 
information  that  becomes  available  during  practice.  Any  viable  learning 
theory  must  describe  the  cognitive  mechanism  that  interfaces  those  two 
knowledge  sources.  The  fate  of  the  theory  proposed  here  will  ultimately 
be  determined  by  comparative  evaluations  with  alternative  computation^ 
theories  of  the  funaion  of  prior  knowledge  in  learning,  once  such  alter¬ 
native  theories  become  available. 
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